Systems and Methods for Improving Visual Search Using Summarization Feature

ABSTRACT

Systems that search databases of videos or images to identify similar products in a given video or image of a product are disclosed. The content of the given video is represented by a feature vector used to measure the given video&#39;s similarity to either a video or an image. When the system is deployed to recognize particular fashion items in videos, some such videos are taken in uncontrolled settings, and as a result, may have low resolution, poor contrast, minimal focus, motion blur, or low lighting. By recognizing and removing poor quality video frames from the image recognition pipeline, associating products across video frames to form tracklets of each product, and enriching the feature representation of each item for best retrieval result by fusing information from multiple video frames depicting the item, the system addresses the aforementioned shortcomings.

BACKGROUND

Object detection from images and videos is an important computer visionresearch problem. Object detection from images and videos paves the wayfor a multitude of computer vision tasks including similar objectsearch, object tracking, and collision avoidance for self-driving cars.Object detection performance may be affected by multiple challengesincluding imaging noises (motion blur, lighting variations), scale,object occlusion, self-occlusion and appearance similarity with thebackground or other objects. Therefore, it is desirable to developrobust image processing systems that improve the identification ofobjects belonging to a particular category from other objects in theimage, and that are capable of accurately determining the location ofthe object within the image (localization).

BRIEF DESCRIPTION OF THE DRAWINGS

Various techniques will be described with reference to the drawings, inwhich:

FIG. 1 shows an illustrative example of a system that presents productrecommendations to a user, in an embodiment;

FIG. 2 shows an illustrative example of a data record for storinginformation associated with a look, in an embodiment;

FIG. 3 shows an illustrative example of a data record for storinginformation associated with an image, in an embodiment;

FIG. 4 shows an illustrative example of a data record for storinginformation associated with a product, in an embodiment;

FIG. 5 shows an illustrative example of an association between an imagerecord and a look record, in an embodiment;

FIG. 6 shows an illustrative example of a process that, as a result ofbeing performed by a computer system, generates a look record based onan image, in an embodiment;

FIG. 7 shows an illustrative example of an association between a lookrecord and a set of product record, in an embodiment;

FIG. 8 shows an illustrative example of a process that, as a result ofbeing performed by a computer system, identifies a set of products toachieve a desired look, in an embodiment;

FIG. 9 shows an illustrative example of an association between a productowned by a user, and related product that may be worn with the user'sproduct to achieve a look, in an embodiment;

FIG. 10 shows an illustrative example of a process that, as a result ofbeing performed by a computer system, identifies a product that may beworn with an indicated product to achieve a particular look, in anembodiment;

FIG. 11 shows an illustrative example of a process that identifies,based at least in part on a specified article of clothing, a set ofadditional articles that, when worn in combination with the selectedarticle of clothing, achieve a particular look, in an embodiment;

FIG. 12 shows an illustrative example of a user interface product searchsystem displayed on a laptop computer and mobile device, in anembodiment;

FIG. 13 shows an illustrative example of executable instructions thatinstall a product search user interface on a website, in an embodiment;

FIG. 14 shows an illustrative example of a user interface foridentifying similar products using a pop-up dialog, in an embodiment;

FIG. 15 shows an illustrative example of a user interface foridentifying similar products, in an embodiment;

FIG. 16 shows an illustrative example of a user interface foridentifying a look based on a selected article of clothing, in anembodiment;

FIG. 17 shows an illustrative example of a user interface that allowsthe user to select a look from a plurality of looks, in an embodiment;

FIG. 18 shows an illustrative example of a user interface that allowsthe user to select a particular article of clothing from within a look,in an embodiment;

FIG. 19 shows an illustrative example of a desktop user interface fornavigating looks and related articles of clothing, in an embodiment;

FIG. 20 shows an illustrative example of a user interface for navigatinglooks implemented on mobile device, in an embodiment;

FIG. 21 shows an illustrative example of a user interface for navigatinglooks implemented on a web browser, in an embodiment;

FIG. 22 shows an illustrative example of a generic object detector and ahierarchical detector, in an embodiment;

FIG. 23 shows an illustrative example of a category tree representingnodes at various levels, in an embodiment;

FIG. 24 shows an illustrative example of a normalized error matrix, inan embodiment;

FIG. 25 shows an illustrative example of a hierarchical detector thatcan correct for missing detections from a generic detector, in anembodiment;

FIG. 26 shows an illustrative example of how a hierarchical detectorsuppresses sibling output in contrast to a generic detector, in anembodiment;

FIG. 27 shows an illustrative example of a graphical user interface thatenables utilization of techniques described herein, in an embodiment;

FIG. 28 shows an illustrative example of a graphical user interface thatenables utilization of techniques described herein, in an embodiment;

FIG. 29 shows an illustrative example of a triplet with overlaidbounding boxes, in an embodiment;

FIG. 30 shows a first portion of an illustrative example of a networkdesign that captures both coarse-grained and fine-grainedrepresentations of fashion items in an image, in an embodiment;

FIG. 31 shows a second portion of an illustrative example of a networkdesign that captures both coarse-grained and fine-grainedrepresentations of fashion items in an image, in an embodiment;

FIG. 32 shows an illustrative example of how batches are formed togenerate triplets online, in an embodiment;

FIG. 33 shows an illustrative example of hard negative products miningsteps, in an embodiment;

FIG. 34 shows an illustrative example of image and video productretrieval, in an embodiment;

FIG. 35 shows an illustrative example of a video product retrievalsystem that identifies one or more products from a video or image, in anembodiment;

FIG. 36 shows an illustrative example of quality head branch training,in an embodiment;

FIG. 37 shows an illustrative example of a product web page thatincludes product attributes, in an embodiment;

FIG. 38 shows an illustrative example of output from a detection andattribute network, in an embodiment;

FIG. 39 shows an illustrative example of a schematic of a detection andattribute network, in an embodiment;

FIG. 40 illustrates an environment in which various embodiments can beimplemented; and

FIG. 41 illustrates aspects of an example environment for implementingaspects in accordance with various embodiments.

DETAILED DESCRIPTION

The current document describes an image processing system that iscapable of identifying objects within images or video segments. In anembodiment, the system operates by identifying regions of an image thatcontain an object. In an embodiment, for each region, attributes of theobject are determined, and based on the attributes, the system mayidentify the object, or identify similar objects. In some embodiments,the system uses a tracklet to track an object though a plurality ofimage frames within a video segment, allowing more than one image frameto be used in object detection, and thereby increasing the accuracy ofthe object detection.

In an embodiment, the system determines a category for each objectdetected. In one example, a hierarchical detector predicts a tree ofcategories as output. The approach learns the visual similaritiesbetween various object categories and predicts a tree for categories.The resulting framework significantly improves the generalizationcapabilities of the detector to the novel objects. In some examples, thesystem can detect the addition of novel categories without the need ofobtaining new labeled data or retraining the network.

Various embodiments described herein utilize a deep learning basedobject detection framework and similar object search framework thatexplicitly models the correlations present between various objectcategories. In an embodiment, an object detection framework predicts ahierarchical tree as output instead of a single category. For example,for a ‘t-shirt’

object, a detector predicts [‘top innerwear’

‘t-shirt’]. The upper level category ‘top innerwear’ includes[blouses_shirts, ‘tees’, ‘tank_camis’, ‘tunics’, ‘sweater’]. Thehierarchical tree is estimated by analyzing the errors of an objectdetector which does not use any correlation between the objectcategories. Accordingly, techniques described herein comprise;

-   -   1. A hierarchical detection framework for the object domain.    -   2. A method to estimate the hierarchical/semantic tree based at        least in part on directly analyzing the detection errors.    -   3. Using the estimated hierarchy tree to demonstrate addition of        novel category object and performing search.

In an embodiment, the system determines regions of interest within animage by computing bounding boxes and the corresponding categories forthe relevant objects using visual data. In some examples, the categoryprediction assumes that only one of the K total object categories isassociated with each bounding boxes. The 1-of-K classification may beachieved by a ‘Softmax’ layer which encourages each object category tobe far away as possible from the other object categories. However, insome examples, this process may fail to exploit the correlationinformation present in the object categories. For example, a ‘jeans’ iscloser to ‘pants’ compared to ‘coat’. In an embodiment, exploitation ofthis correlation is accomplished by first predicting ‘lower body’ andchoosing one element from the ‘lower body’ category which is a set of‘jeans’, ‘pants’, and ‘leggings’ via hierarchical tree prediction. Insome embodiments, the system improves the separation of objectsbelonging to a particular category from other objects, and improves theidentification of the location of the object in the image.

In an embodiment, a hierarchical prediction framework is integrated withan object detector. In some embodiments, the generic detector can be anydifferentiable (e.g., any deep learning based detector) mapping f(I)

bb, c that takes an input image I and produces a list of bounding boxesbb and a corresponding category c for each of the bounding box. Thehierarchical detector learns a new differentiable mapping fh(I)

bb F(c) that produces a path/flow from root category to the leafcategory F(C) for each bounding box. A differentiable mapping, in anembodiment, is a mathematical function that can be differentiated withrespect to its parameters to estimate the value of those parameters fromground truth data via gradient-based optimization. In an exampleimplementation, there are two steps involved in going from a genericdetector to the hierarchical detector. The first step, in an embodiment,is to train a generic detector and estimate the category hierarchy treeas discussed below. Based on the category hierarchy, the deep learningframework is retrained with a loss function designed to predict thehierarchical category.

To estimate the category tree, in an embodiment, one estimates thevisual similarity between various categories. Techniques disclosed andsuggested herein improve on conventional techniques by organizing thevisually similar categories for an object detector. Much prior work hasfocused on using attribute-level annotations to generate annotation taghierarchy instead of category-level information. However, such an effortrequires large amounts of additional human effort to annotate eachcategory with information such as, viewpoint, object part location,rotation, object specific attributes. Some examples generate anattribute-based (viewpoint, rotation, part location etc.) hierarchicalclustering for each object category to improve detection. In contrast,some embodiments disclosed herein, use category level information andonly generate a single hierarchical tree for the object categories.

Example implementations of the present disclosure estimate a categoryhierarchy by first evaluating the errors of a generic detector trainedwithout any consideration of distance between categories andsubsequently analyzing the cross-errors generated due tovisual-similarity between various categories. In an embodiment, aFaster-RCNN based detector is trained and detector errors are evaluated.For instance, a false positive generated by generic detector(Faster-RCNN detector in the current case) can be detected and some orall the errors that result from visually similar categories arecomputed. These errors, for example, may be computed by measuring thefalse positives with bounding boxes having an intersection-over-union(“IOU”) ratio between 0.1 to 0.5 with another object category. In thismanner, visually similar classes such as ‘shoes’ and ‘boots’ will befrequently misclassified with each other resulting in highercross-category false positive errors.

Many conventional techniques have focused on using attribute-levelinformation apart from the category specific information to performdetection for novel object categories. Some examples use attribute-levelinformation to detect objects from novel categories. For instance, a newobject category ‘horse’ is recognized as a combination of ‘legs’,‘mammal’ and ‘animal’ categories. Attribute-based recognition requiresone to learn attribute specific classifiers and attribute-levelannotation for each of object categories. In comparison, someembodiments of the present disclosure neither require attributeannotations nor any attribute specific classifiers. For each newcategory, an expected root-level category may be assigned andsubsequently a bounding box with highest confidence score for thatcategory may be estimated.

Systems operating according to various embodiments disclosed hereinperform category specific non-maximal suppression to select boundingboxes for each leaf node categories, where the bounding boxes may beunique. For all the lower level categories, such systems may alsosuppress the output by considering bounding boxes from all the childrennodes. In some embodiments, this helps reduce spurious lower levelcategory boxes whenever bounding boxes from more specific categories canbe detected.

In various embodiments, a user interface on a client computer systempresents product information to a user. In some examples, the clientcomputer system is a desktop computer system, a notebook computersystem, a tablet device, a cellular phone, a thin client terminal, akiosk, or a point-of-sale device. In one example, the client computersystem is a personal computer device running a web browser, and the userinterface is served from a web server operated by the merchant to theweb browser. The web browser renders the user interface on a display ofthe personal computer device, and the user interacts with the displayvia a virtual keyboard or touch screen. In another example, the personalcomputer device is a personal computer system running a web browser, andthe user interacts with the user interface using a keyboard and a mouse.Information exchanged between the client computer system and the Webserver operated by the merchant may be exchanged over a computernetwork. In some embodiments, information is encrypted and transmittedover a secure sockets layer (“SSL”) or transport layer security (“TLS”)connection.

In various examples, an important consideration is whether the user isable to determine how to combine the offered product with other productsto produce a desired appearance or “look.” For example, the user maywish to determine whether the offered product “goes with” other productsor articles of clothing already owned by the user. In other examples,the user may wish to identify other products that may be purchased towear with the offered product. In some situations, how the product willbe used or worn to produce a desired look may be more definitive thanthe attractiveness of the individual product. Therefore, it is desirableto produce a system and a user interface that allows the user to easilyidentify related items that can be used with the offered product toproduce various looks.

In an embodiment, the system provides a software development kit (“SDK”)that can be added to the web code of a retailer's website. The SDK addsfunctionality to the retailer's website allowing users to identify itemsrelated to products offered for sale that will produce a desired look.The added functionality allow users to feel at ease by providinginformation on how to wear the offered product by providing stylerecommendations related to the offered product.

In an embodiment, the SDK visually matches a brand's social mediacontent and lookbook photos to corresponding product pages on themerchant's Web site. The SDK presents a user interface that allows theusers to see how celebrities and ordinary people wear the productsoffered for sale. The system also identifies similar products to theitems that the people are wearing with in the recommended look, so thatusers can compare the entire look.

In an embodiment, the visual search functionality is added to themerchant's website by adding a link to a JavaScript to merchant'swebsite code. The SDK serves as a layer on top of the original website,and in general, the SDK does not interfere with the how the merchant'swebsite operates.

In an embodiment, a user accesses the merchant's website using a webbrowser running on the client computer system. The web browser loads thecode from the merchant's website which includes a reference to the SDK.The web browser loads executable code identified by the reference andexecutes it within the web browser. In some examples, the executablecode is a JavaScript plug-in which is hosted on a computer system.

In an embodiment, the executable code downloaded by the SDK into theusers web browser is executed, causing the web browser to display theuser interface described herein to the user. In an embodiment, theexecutable code also causes the web browser to contact an onlineservice. The online service maintains a database of looks, where eachlook includes a list of products that, when worn together form theassociated look. In an embodiment, each look is stored in associationwith a set of products. In another embodiment, each product in the setof products is characterized as a set of characteristics. For example, aparticular look may include a shirt, a pair of pants, and a hat. Theshirt, pants, and hat may be identified as particular products that canbe purchased. Alternatively, each product may be described as a set ofcharacteristics. For example, the hat may be described as short, Brown,and Tweed, and the shirt may be described as white, longsleeved, V-neck,and cotton knit.

In an embodiment, the online service is provided with a particulararticle of clothing in the form of a SKU, a product identifier, a set ofcharacteristics, or an image, and the online service identifies one ormore looks that include the particular article of clothing. In someembodiments, the online service identifies one or more looks thatinclude similar articles of clothing. The online service returns thelook in the form of an image, and information regarding the individualproducts that are associated with the look. The online service may alsoinclude bounding box information indicating where each product is wornon the image.

FIG. 1 shows an illustrative example of a system 100 that presentsproduct recommendations to a user, in an embodiment. In an embodiment,the system 100 includes a Web server 102 that hosts an website. Invarious examples, the Web server 102 may be a computer server, servercluster, virtual computer system, computer runtime, or web hostingservice. The website is a set of hypertext markup language (“HTML”)files, script files, multimedia files, extensible markup language(“XML”) files, and other files stored on computer readable media that isaccessible to the Web server 102. Executable instructions are stored ona memory of the Web server 102. The executable instructions, as a resultof being executed by a processor of the Web server 102, cause the Webserver 102 to serve the contents of the website over a network interfacein accordance with the hypertext transport protocol (“HHTP”) or securehypertext transport protocol (“HTTPS”). In an embodiment, the Web server102 includes a network interface connected to the Internet.

A client computer system 104 communicates with the Web server 102 usinga web browser via a computer network. In an embodiment, the clientcomputer system 104 may be a personal computer system, a laptop computersystem, a tablet computer system, a cell phone, or handheld device thatincludes a processor, memory, and an interface for communicating withthe Web server 102. In an embodiment, the interface may be an Ethernetinterface, a Wi-Fi interface, cellular interface, a Bluetooth interface,a fiber-optic interface, or satellite interface that allowscommunication, either directly or indirectly, with the Web server 102.Using the client computer system 104, a user 106 is able to exploreproducts for sale as well as looks that are presented by the Web server102. In various examples, the Web server 102 recommends various productsto the user 106 based on product linkages established throughinformation maintained by the Web server 102.

In an embodiment, the Web server 102 maintains a database of styleimages 108, a database of product information 110, and the database oflook information 112. In various examples, style images may includeimages or videos of celebrities, models, or persons demonstrating aparticular look. The database of product information 110 may includeinformation on where product may be purchased, an associated designer orsource, and various attributes of a product such as fabric type, color,texture, cost, and size. The database of look information 112 includesinformation that describes a set of articles that, when worn together,create a desired appearance. In some examples, the database of lookinformation 112 may be used by the Web server 102 to identify articlesof clothing that may be worn together to achieve a particular look, orto suggest additional products for purchase that may be combined with analready purchased product. In an embodiment, recommendations may be madeby sending information describing the set of additional products fromthe Web server 102 to the client computer system 104 via a network.

FIG. 2 shows an illustrative example of a data record 200 for storinginformation associated with a look, in an embodiment. A data structureis an organization of data that specifies formatting, arrangement, andlinkage between individual data fields such that a computer program isable to navigate and retrieve particular data structures and in variousfields of individual data structures. A data record is a unit of datastored in accordance with a particular data structure. The data record200 may be stored in semiconductor memory or on disk that is accessibleto the computer system. In an embodiment, a look record 202 includes alook source data field 204, and an article set 206. The look source datafield 204 may include a uniform resource locator (“URL”), imageidentifier, video segment identifier, website address, filename, ormemory pointer that identifies an image, video segment, or look bookused to generate the look record 202. For example, a look record may begenerated based on an image of the celebrity, and the source of theimage may be identified in the look source data field 204. In anotherexample, a look record may be generated from entries in a look bookprovided by a clothing manufacturer, and the look source data field 204may identify the look book.

The article set 206 is a linked list, array, hash table, or othercontainer structure that holds the set of article records. Each articlerecord in the article set 206 describes an article included in the look.An article can be an article of clothing such as a skirt, shirt, shoes,blouse, hat, jewelry, handbag, watch, or wearable item. In the exampleillustrated in FIG. 2 , the article set 206 includes a first article 208and a second article 220. In various examples, other numbers of articlesmay be present in the article set 206. The first article 208 includes anarticle position field 210 and a set of article attributes 212. Thearticle position field 210 describes a position in which the article isworn. For example, an article may be worn as a top, as a bottom, as ahat, as gloves, as shoes, or carried as a handbag. The set of articleattributes 212 describes characteristics of the article and in anexample includes a texture field 214, a color field 216, and a patternfield 218. The texture field 214 may specify a fabric type, a texture, alevel of translucence, or thickness. The color field 216 may indicate anamed color, a color hue, a color intensity, a color saturation, a levelof transparency, a reflectivity, or optical characteristic of thearticle. The pattern field 218 may describe a fabric pattern, a weave, aprint design, or image present on the article. The second article 220includes data fields similar to those in the first article 208 includingan article position field 222 and an article attribute set 224.

FIG. 3 shows an illustrative example of a data record 300 for storinginformation associated with an image, in an embodiment. An image record302 includes a set of image properties 304 and information thatdescribes an article set 306. The image record 302 may be generated todescribe the contents of a digital image or a video segment. Forexample, if the image record 302 describes a digital image, the set ofimage properties 304 includes an image source 308 that identifies theimage file. If the image record 302 describes a video segment, the imagesource 308 identifies a segment of a video file. An image subject field310 includes information describing the subject of the image. Forexample, the subject may be a model, an actor, or a celebrity.

In an embodiment, the article set 306 includes one or more articlerecords that correspond to a set of articles found within the image. Thearticle records may be stored as an array, linked lists, hash table,relational database, or other data structure. An article record 312includes an article position 314 and a set of article attributes 316.The article position 314 describes the location of the article relativeto the subject of the image. For example, the article position mayindicate that the article is a hat, pants, shoes, blouse, dress, watch,or handbag. The set of article attributes 316 may include a texture,color, pattern, or other information associated with an article asdescribed elsewhere in the present application (for example, in FIG. 2).

FIG. 4 shows an illustrative example of a data record 400 for storinginformation associated with a product, in an embodiment. A product is anarticle captured in an image or available-for-sale. For example, anarticle may be described as a large white T-shirt, and a particularproduct matching that article may be an ABC Corporation cotton large Tsold by retailer XYZ. In an embodiment, a product record 402 includes aproduct position field 404, a set of product attributes 406, and a setof availability information 408. The set of product attributes 406indicates how the product (such as a hat, pants, shirt, dress, shoes, orhandbag) is worn (on the head, legs, torso, whole body, feet, or hand).The set of product attributes 406 contains a variety of subfields thatdescribe attributes of the product. In an example, the set of productattributes 406 includes a texture field 410, a color field 412, and apattern field 414. In an embodiment, the product attributes may includesome or all of the attributes of an article. In some examples, productattributes may include a superset or a subset of the attributes of anarticle. For example, product attributes may include characteristicsthat are not directly observable from an image such as a fabric blend, afabric treatment, washing instructions, or country of origin.

In an embodiment, the set of availability information 408 includesinformation that describes how the product may be obtained by user. Inan embodiment, the set of availability information 408 includes a vendorfield 416, a quantity field 418, a price field 420, and a URL field 422.The vendor field 416 identifies a vendor or vendors offering the productfor sale. The vendor field 416 may include a vendor name, a vendoridentifier, or a vendor website address. The quantity field 418 mayinclude information describing the availability of the product includingthe quantity of the product available for sale, the quantity of theproduct available broken down by size (for example how many small,medium, and large), and whether the product is available for backorder.The price field 420 indicates the price of the product and may includequantity discount information, retail, and wholesale pricing. The URLfield 422 may include a URL of an Web site at which the product may bepurchased.

FIG. 5 shows an illustrative example of an association 500 between animage record and a look record, in an embodiment. An association betweenrecords may be established using a pointer, a linking record thatreferences each of the linked records, or by establishing matching datavalues between the associated records. FIG. 5 illustrates an associationbetween a set of articles detected in an image, and a set of articlesthat make up a look. In an embodiment, the system is provided with animage in the form of an URL, filename, image file, or video segment. Thesystem processes the image to identify a set of articles worn by asubject. For example, a picture of a celebrity may be submitted to thesystem to identify a set of articles worn by the celebrity. Once thearticles worn by the subject of the image are identified, an associatedlook record can be created.

In an embodiment, an image record 502 includes a set of image properties506 and information that describes an article set 508. The image record502 may be generated to describe the contents of a digital image or avideo segment. For example, if the image record 502 describes a digitalimage, the set of image properties 506 includes an image source fieldthat identifies the image file. If the image record 502 describes avideo segment, the image properties 506 identify a segment of a videofile. An image subject field may include information describing thesubject of the image. For example, the subject may be a model, an actor,or a celebrity.

In an embodiment, the article set 508 includes one or more articlerecords that correspond to a set of articles found within the image. Thearticle records may be stored as an array, linked lists, hash table,relational database, or other data structure. An article record 510includes an article position 512 and a set of article attributes 514.The article position 512 describes the location of the article relativeto the subject of the image. For example, the article position (head,feet, torso etc.) may suggest that the article is a hat, pants, shoes,blouse, dress, watch, or handbag. The set of article attributes 514 mayinclude a texture, color, pattern, or other information associated withan article as described elsewhere in the present application (forexample, in FIG. 2 ).

In an embodiment, a look record 504 includes a look source data field516, and an article set 518. The look source data field 516 may includea uniform resource locator (“URL”), image identifier, video segmentidentifier, website address, filename, or memory pointer that identifiesan image, video segment, or look book used to generate the look record504. For example, a look record may be generated based on an image ofthe celebrity, and the source of the image may be identified in the looksource data field 516. In another example, a look record may begenerated from entries in a look book provided by a clothingmanufacturer, and the look source data field 516 may identify the lookbook.

The article set 518 is a linked list, array, hash table, or othercontainer structure that holds the set of article records. Each articlerecord in the article set 518 describes an article included in the look.An article can be an article of clothing such as a skirt, shirt, shoes,blouse, hat, jewelry, handbag, watch, or wearable item. In the exampleillustrated in FIG. 5 , the article set 518 includes an article 520. Invarious examples, other numbers of articles may be present in thearticle set 518. The article record 510 includes an article positionfield 522 and a set of article attributes 524. The article positionfield 522 describes a position in which the article is worn. Forexample, an article may be worn as a top, as a bottom, as a hat, asgloves, as shoes, or carried as a handbag. The set of article attributes524 describes characteristics of the article and, for example, mayinclude a texture field, a color field, and a pattern field.

In various embodiments, the look record 504 may be used by the system tomake recommendations to a user by identifying particular products thatmatch articles in the article set 518. By identifying particularproducts that match the articles in the article set 518, the systemhelps the user identify those products that, when worn together, achievea look similar to that captured in the image.

FIG. 6 shows an illustrative example of a process 600 that, as a resultof being performed by a computer system, generates a look record basedon an image, in an embodiment. The process begins at block 602 with acomputer system acquiring an image of the subject. In various examples,the image may be acquired by acquiring a file name, file identifier, astream identifier, or a block of image data. In additional examples, theimage may be acquired as a portion of a video stream or as a compositeof a number of frames within a video stream. For example, the image maybe specified as information that identifies a video file, and a positionwithin the video file.

In an embodiment, at block 604, the computer system identifies a set ofarticles worn by a subject within the image. In some embodiments, thecomputer system identifies the particular subject as a particularcelebrity or model. In some embodiments, the computer system identifiescharacteristics of the subject such as male, female, youth, or infant.In some examples, the computer system identifies a plurality of subjectspresent in the image. In an embodiment, for at least one of thesubjects, the computer system identifies a set of articles worn by thesubject. As described elsewhere in the current application, articles maybe articles of clothing, accessories, jewelry, handbags, or items wornby the subject. The computer system identifies a position or way inwhich each article is worn by the subject. In an embodiment, thecomputer system identifies the article as a hat, pants, dress, top,watch, handbag, necklace, bracelet, earing, pin, broach, sash, or belt.

In an embodiment, at block 606, the computer system identifies one ormore attributes for each article worn by a subject. Attributes may beidentified such as those identified elsewhere in the current document.In various embodiments, the computer system identifies a texture, color,material, or finish on the article. In additional embodiments, thecomputer system identifies a size of the article. The size of thearticle may be determined based at least in part on the identity of thesubject.

At block 608, the computer system generates a record of a look inaccordance with the items worn by a particular subject in the image. Insome embodiments, the computer system generates a look record based onthe articles worn by each subject identified in the image. The lookrecord includes source information that identifies the image, andarticle information identified above. The look record may be constructedin accordance with the record structure shown in FIG. 2 .

FIG. 7 shows an illustrative example of an association 700 between alook record and a set of product records, in an embodiment. In anembodiment, a look record can be used by the system to identify productsthat, when worn together, can reproduce an overall appearance or “look”associated with the look record. In an embodiment, a look record 702includes a look source data field 710, and an article set 712. The looksource data field 710 may include a uniform resource locator (“URL”),image identifier, video segment identifier, website address, filename,or memory pointer that identifies an image, video segment, or look bookused to generate the look record 702. For example, a look record may begenerated from entries in a look book provided by a clothingmanufacturer, and the look source data field 710 may identify the sourceof the look book.

The article set 712 is a linked list, array, hash table, or othercontainer structure that holds the set of article records. Each articlerecord in the article set 712 describes an article included in the look.An article can be an article of clothing such as a skirt, shirt, shoes,blouse, hat, jewelry, handbag, watch, or wearable item. In the exampleillustrated in FIG. 7 , the article set 712 includes a first article714, a second article 720, and a third article 726. In various examples,other numbers of articles may be present in the article set 712. Eacharticle includes information that describes an article position andarticle attributes. In the example shown, the first article 714 includesan article position field 716 and a set of article attributes 718. Thesecond article 720 includes an article position field 722 and a set ofarticle attributes 724. The third article 726 includes an articleposition field 728 and a set of article attributes 730. The articleposition fields describe a position in which the associated article isworn. The article attributes describe various aspects of each article asdescribed elsewhere in the present document.

In an embodiment, the computer system identifies products matchingvarious articles in the look record 702. In the example shown in FIG. 7, the computer system identifies a first product record 704 that matchesthe first article 714, a second product record 706 that matches thesecond article 720, and a third product record 708 that matches thethird article 726. In some examples, the computer system may identify aplurality of products that match a particular article in the look record702. Each product record includes an associated product position 732,738, 744, product attributes 734, 740, 746, and product availability736, 742, 748, as described elsewhere in the present document. In anembodiment, a product matches an article if the article position matchesthe product position and a threshold proportion of the productattributes match the attributes of the associated article. In someexamples, all product attributes match all article attributes. Inanother example, selected attributes such as color and style match tomatch the product and an article. In yet another example, a measure ofsimilarity is determined between a product and an article, and a matchis determined when the measure of similarity exceeds a threshold value.By identifying a set of products that match a set of articles in a look,the system is able to recommend products to users that, when worntogether, produce a similar look. In some examples, the system usesinformation in the product records to direct the user to websites ormerchants from which the particular products can be purchased.

FIG. 8 shows an illustrative example of a process 800 that, as a resultof being performed by a computer system, identifies a set of products toachieve a desired look, in an embodiment. In an embodiment, the processbegins at block 802 with the computer system identifying a look desiredby a user. The look may be identified by selecting an image from which alook is generated, by selecting a look record from which a look isalready been generated or otherwise acquired, or by supplying an imageor video segment from which look record can be generated.

At block 804, the computer system identifies the attributes of thearticles present in the selected look. In various examples, the look mayinclude a plurality of articles where each article has a set ofattributes as described above. At block 806, the system searches aproduct database to identify products having attributes that match thearticles in the selected look. In some embodiments, a product databaseis specified to limit the search to products from a given manufactureror available from a particular merchant website. In someimplementations, matching products have all of the attributes of anarticle in the look. In another implementation, matching products have athreshold percentage of the attributes of an article in the look.

At block 808, the computer system presents the identified products tothe user. The products may be presented in the form of a webpage havinggraphical user interface elements as shown and described in the presentdocument. In some examples, the user may be directed to similar looks toidentify additional products.

FIG. 9 shows an illustrative example of an association 900 between aproduct owned by a user, and a related product that may be worn with theuser's product to achieve a look, in an embodiment. In an embodiment, afirst product record 902 is used to identify a look record 904 which inturn is used to identify a second product record 906. The first productrecord 902 holds information that represents a product selected by theuser. In some examples, the product is a product in a cart of an Website. In another example, the product is a product previously purchasedby the user. In yet another example, the product is a product currentlyowned by the user. The first product record includes a product positionfield 908, a set of product attributes 910, and product availabilityinformation 912. The product position field 908 and a set of productattributes 910 used to identify the look record 904 based on thepresence of an article that matches the attributes in position of thefirst product record 902. In some implementations a plurality of lookrecords may be identified based on the presence of matching articles.

In an embodiment, the look record 904 includes a look source field 914,and a set of articles 916. In the example shown in FIG. 9 , the set ofarticles 916 includes a first article 917, a second article 921, and athird article 925. The first article 917 includes an article positionfield 918 and a set of article attributes 920. The second article 921includes an article position field 922 and a set of article attributes924. The third article 925 includes an article position field 926 and aset of article attributes 928.

In the example illustrated in FIG. 9 , the computer system identifiesthat the attributes in the first product record 902 match the articleattributes 928 of the third article 925. As a result of the presence ofthe matching article, the computer system examines the other articles inthe set of articles 916 and searches for products matching theattributes of each article in the set of articles 916. In the exampleshown in FIG. 9 , the computer system identifies the second productrecord 906 which has a product position field 930, a set productattributes 932, and a set of product availability information 934, anddetermines that the product attributes 932 and product position field930 match the corresponding article position field 918 and articleattributes 920 of the first article 917. In an embodiment, the computersystem recommends the product represented by the second product record906 as one that can be worn with the product associated with the firstproduct record 902 to achieve the look represented by the look record904.

FIG. 10 shows an illustrative example of a process 1000 that, as aresult of being performed by a computer system, identifies a productthat may be worn with an indicated product to achieve a particular look.In an embodiment, the process begins at block 1002 with the computersystem identifying a product owned by a user. In some examples, thecomputer system searches a purchase history of the user and identifiesthe product as one that has previously been purchased by the user. Inanother implementation, the product may be a product in an electronicshopping cart of a website. At block 1004, the computer systemdetermines the attributes of the identified product such as the color,texture, pattern, and position of the product when worn by the user. Insome implementations, the attributes are determined based on an image ofthe product. In other implementations, the products are retrieved from aproduct database provided by the manufacturer or retailer.

In an embodiment, at block 1006, the computer system identifies a lookthat includes a product that matches the identified product. In someimplementations, the computer system identifies look records from adatabase of look records that have a sufficient number of matchingattributes with the identified product. In another implementation, thecomputer system identifies look records that contain a matching product.At block 1008, the computer system searches the identified look recordsand identifies additional articles in those look records. For eachadditional article in the identified look records, the computer systemidentifies the attributes of those articles, and at block 1010,identifies products from a product database containing a sufficient setof matching attributes of those articles. In this way, in some examples,the system identifies products that when worn with the identifiedproduct, “go together” or produce the “look” associated with the linkinglook record.

At block 1012, the system presents the identified products asrecommendations to the user. In some implementations, therecommendations may be presented along with the look so that the usercan visualize how the articles may be worn together to produce thelinking look.

FIG. 11 shows an illustrative example of a process that identifies,based at least in part on a specified article of clothing, a set ofadditional articles that, when worn in combination with the selectedarticle of clothing, achieve a particular look, in an embodiment. Whileviewing a website, a user identifies a particular product such as ashirt as indicated in FIG. 11 . In order to view looks that are relevantto the particular product, the user is able to click on an icon, button,or other UI element that signals the SDK to find related looks.Information identifying a product is sent from the user's web browser toan online service. In some embodiments, information is an image of theproduct. In other embodiments, information is a SKU, product identifier,or list of product characteristics.

The online service receives the identifying information, and uses theidentifying information to identify one or more associated looks. Insome embodiments, associated looks are identified as looks that includethe identified product. In another embodiment, associated looksidentified as looks that include a product similar to the identifiedproduct. The online service returns look information to the web browser.The look information includes an image of the look, a list of productsassociated with a look, and a bounding box identifying each associatedproduct in the image of the look.

Upon receiving the information identifying the look, the executable coderunning on the browser displays the look, and highlights the productsthat are associated with the look. In some examples, each productassociated with a look is surrounded by a bounding box. By selecting abounding box, the user is presented with an image of the associatedproduct. In some examples, the user is presented with additionalinformation about the associated product and may also be presented withan option to purchase the associated product. In some embodiments, theuser interface allows the user to explore products similar to a selectedproduct. In this way, users may be provided with the matching productsthat are associated with a look, as well as similar products that may beused to achieve a similar look.

In various embodiments, the system attempts to identify, from aspecified set of catalogs, products that are present within a particularlook, based at least in part on a set of identified characteristics ofeach product in the look. If the system is unable to find a productmatching a particular set of product characteristics, the system willattempt to identify the most similar product from the set of catalogs.The system presents product images for the identified products to theuser. If the user selects a product image, the system identifies one ormore similar products from the available catalogs, and the similarproducts are presented to the user in order of their similarity to theselected product. In some embodiments, the available sources of productinformation may be limited to a particular set of catalogs selected bythe user hosting the SDK. In some examples, results may be sorted sothat similar products from a preferred catalog are presented higher inthe search results.

In an embodiment, the system may be adapted to identify articles ofclothing that may be worn in combination with other articles of clothingto produce a desired look or overall appearance. In an embodiment, auser selects an article of clothing such as a shirt, dress, pants,shoes, watch, handbag, jewelry, or accessory. In various embodiments,the article may be selected from an web page, a digital image, or even avideo stream. In an embodiment, the system identifies one or more looksthat contain the selected article, or one or more looks that contain anarticle similar to the selected article. A look is a collection ofarticles that, when worn together, create a particular overallappearance. Looks may be ranked in accordance with a preference of theuser, a score assigned by an influencer, a popularity measure, a styletag, a celebrity identity, or other measure. In some examples, the userinterface allows the user to navigate a plurality of looks to identify adesired overall appearance. In some examples, the system allows the userto select a look, and in response, the user interface presentsassociated articles of clothing that, when worn together, produce theselected look. In some embodiments, the user interface identifiessimilar articles of clothing that may be combined to produce theselected look.

FIG. 12 shows an illustrative example of a user interface product searchsystem displayed on a laptop computer and mobile device, in anembodiment. In various embodiments, the SDK may be applied to retailerwebsites, social media websites, and browser extensions. Platforms thatimplement the SDK may be accessed from mobile devices or desktopdevices.

FIG. 13 shows an illustrative example of executable instructions thatinstall a product search user interface on a website, in an embodiment.In one example, the SDK is installed by adding the lines of code shownto a webpage on a merchant website. The SDK may be served from a varietyof locations including the merchant's website itself or from athird-party. The SDK may be served from various websites including thirdparty Web platforms.

The website owner can customize the design completely using cascadingstyle sheets (“CSS”) within their own website code.

FIG. 14 shows an illustrative example of a user interface foridentifying similar products using a pop-up dialog, in an embodiment. Inan example shown in FIG. 14 , an icon in the left-hand panel is clickedto bring up the pop-up dialog showing the product and similar products.Clicking on the icon generates a call to the application programminginterface, and the identity of the product is communicated to an onlineservice. In some embodiments, the identity of the product iscommunicated in the form of an image. In other embodiments, the identityof the product is communicated in the form of a product identifier, orlist of product characteristics. The online service identifies similarproducts, and information describing the similar products includingimages of the similar products is returned to the SDK running on thebrowser. The SDK displays the center dialog showing the product and thesimilar products. In some embodiments, bounding boxes appear indicatingan identified product. By swiping left on the returned products, the SDKpresents a sequence of similar products. By scrolling up and down theuser can see different categories of similar items. For example, byscrolling up and down the user can see similar tops, or similar shoes.In the example shown in FIG. 14 , the bounding boxes have a color thatmatches the color bar underneath each similar product.

FIG. 15 shows an illustrative example of a user interface foridentifying similar products, in an embodiment. In an embodiment, whenthe user selects a product, information identifying the product is sentto an online service. The online service processes the image andidentifies one or more products, each of which is surrounded by acolored bounding box. The image and information identifying the boundingbox is returned to the client.

When the user clicks on a bounding box, other bounding boxes are mutedto indicate selection of the bounding box. Products matching theselected product (that is associated with the selected bounding box) arehighlighted in the bottom portion of the pop-up dialog.

In some examples, an arrow pointing to the right appears as indicated inthe dialogue on the right half of FIG. 15 . By swiping across theproduct image, the SDK receives information that identifies the product,and the online service identifies looks that are associated with theproduct. When a user selects a product on the similar products pop-up,the user is led to the product page of the product being clicked.

FIG. 16 shows an illustrative example of a user interface foridentifying a look based on a selected article of clothing, in anembodiment. In one example, the user swipes over a search image orclicks on an arrow at the edge of the image to generate a signal thatcauses the SDK to provide looks that are associated with the item shown.In some embodiments, the SDK produces looks that are based on celebrityphotos. In other embodiments, the SDK produces looks that are based onInstagram pages. In another embodiment, the SDK identifies looks from astylebook or Instagram feed of a retailer or brand. In someimplementations, the system produces a lookbook which is a collection oflooks for a particular product.

When viewing a particular look, arrows at the edges of the look imageallow the user to navigate back to the product page (by clicking left orswiping right) or forward to view additional looks (by clicking right orswiping left). In some examples, a thumbnail of the original productphoto appears below the look, and clicking on the photo of the productwill navigate back to the product page. In some examples, a similarproduct pop-up displays similar items to those detected in the currentphoto.

FIG. 17 shows an illustrative example of a user interface that allowsthe user to select a look from a plurality of looks, in an embodiment.For example, using the user interface illustrated in FIG. 17 , the useris able to swipe right on the picture to select between various looks.Clicking the right arrow or swiping left advances to the next look, andclicking the left arrow or swiping right advances to the previous look.In some implementations, the sequence of looks is transmitted to thebrowser from the online service, and the selection occurs between storedlooks within the client software. In other implementations, swiping leftor right requests a next look or previous look from the server, and theserver provides information on the next or previous block as requested.

In various implementations, the user interface provides a way for theuser to view products associated with the current look. In the exampleshown in FIG. 17 , the user scrolls up to see similar products that aredetected and matched from the current look image.

In an embodiment, a thumbnail of the product used to identify the lookis shown in the upper left corner of the look image. By selecting thethumbnail, the user is returned to the product screen for the product.

FIG. 18 shows an illustrative example of a user interface that allowsthe user to select a particular article of clothing from within a look,in an embodiment. In one example, the user is able to select individualproducts from the look photo. Individual products of the look photo arehighlighted by a bounding box. By selecting a bounding box, informationidentifying a product is sent to the online service and the onlineservice identifies a set of looks associated with the product.

Upon selecting the product's bounding box, the thumbnail associated withthe previous product is removed, and an arrow pointing to the rightappears. By clicking the arrow or swiping, information identifying theproduct is sent to the online service, and the online service returns aset of looks for the selected product (a lookbook). In this way, stylerecommendations can be acquired for any particular product present in alook.

FIG. 19 shows an illustrative example of a desktop user interface fornavigating looks and related articles of clothing, in an embodiment. Inthe example shown in FIG. 19 , a browser window displays a userinterface for a particular look. An image of the look is shown on theleft part of the page, and bounding boxes are placed around each productidentified in the image. By selecting a particular bounding box, theuser can be shown a set of similar products on the left side of thepage.

In various examples, application dialogs and the pop-up windows sizeresponsively to the browser window. The searched image will be displayedon the left and results on the right. User can use mouse to scroll upand down to explore the results.

User can click the bounding box to start looking at a lookbook of thatitem.

FIG. 20 shows an illustrative example of a user interface for navigatinglooks implemented on a mobile device, in an embodiment. FIG. 20illustrates a mobile device implementing the system. The mobile devicemay be a cellular phone, tablet computer, handheld device, or othermobile device. In one embodiment, the mobile device includes a camera.The user is able to take a picture with the camera, and the resultingimage is displayed on the screen of the mobile device. An icon appearsin the lower right corner of the image indicating that the image may beused to identify a product or look. By clicking on the icon, the imagesuploaded to an online service identifies one or more products in theimage. The service identifies the particular products andcharacteristics of the products in the image. In an embodiment, theonline service returns information to the mobile device that allows theapplication to create bounding boxes around each product in the image.

Once bounding boxes are added to the image, the user may select abounding box to request additional information. In one embodiment, theselection information is returned to the online service, and the onlineservice provides information that identifies the product and optionallysimilar products. Images of the product and similar products aretransferred from the online service to the mobile device, where theredisplayed to the user on the display screen. The user can either view aplurality of similar products, or select a particular product andexplore additional looks that use that particular product.

In some examples, the user may start from an image on a retailer'swebsite, from a social media site, or a photo sharing site or service.

FIG. 21 shows an illustrative example of a user interface for navigatinglooks implemented on a web browser, in an embodiment. In an embodiment,the SDK runs on a personal computer system running a browser. Theembodiment shown in FIG. 21 may be implemented using a personalcomputer, a laptop computer, or tablet computer running a browser.

FIG. 22 shows an illustrative example of a generic object detector and ahierarchical detector, in an embodiment. The hierarchical detectorpredicts a tree of categories as output compared to the generic detectorthat outputs a single category for each bounding box. In an embodiment,clothing product detection from images and videos paves the way forvisual fashion understanding. Clothing detection allows for retrievingsimilar clothing items, organizing fashion photos, artificialintelligence powered shopping assistants and automatic labeling of largecatalogues. Training a deep learning based clothing detector requirespre-defined categories (dress, pants, etc.) and a high volume ofannotated image data for each category. However, fashion evolves and newcategories are constantly introduced in the marketplace. For example,consider the case of jeggings which is a combination of jeans andleggings. To retrain a network to handle jegging category may involveadding annotated data specific to jegging class and subsequentlyrelearning the weights for the deep network. In this paper, we propose anovel method that can handle novel category detection without the needof obtaining new labeled data or retraining the network. Our approachlearns the visual similarities between various clothing categories andpredicts a tree for categories. The resulting framework significantlyimproves the generalization capabilities of the detector to the novelclothing products.

In an embodiment, object detection from images and videos is animportant computer vision research problem. Object detections fromimages and videos enables selection of the relevant region of interestfor a specific category paving the way for a multitude of computervision tasks including similar object search, object tracking, collisionavoidance for self-driving cars. Object detection performance may beaffected by multiple challenges including imaging noises (motion blur,lighting variations), scale, object occlusion, self-occlusion andappearance similarity with the background or other objects. In someembodiments, the focus of object detection is to improve separation ofobjects belonging to a particular category from other objects, andlocalization of the object in the image. In some examples, goingstraight from images to object locations and their correspondingcategory loses the correlation between multiple categories. In someexamples, the resulting methods may have a larger number of falsepositives because of classification error between similar classes.Furthermore, in some examples, addition of a novel object category mayrequire re-training of the object detector.

Techniques described herein relate to a deep learning based objectdetection and similar object search framework that explicitly models thecorrelations present between various object categories. In anembodiment, an object detection framework predicts a hierarchical treeas output instead of a single category. For example, for a ‘t-shirt’object, a detector predicts [‘top innerwear’

‘t-shirt’]. The upper level category ‘top innerwear’ includes[blouses_shirts’, ‘tees’, ‘tank_camis’, ‘tunics’, ‘sweater’]. Thehierarchical tree is estimated by analyzing the errors of an objectdetector which does not use any correlation between the objectcategories. Accordingly, techniques described herein comprise;

-   -   4. A hierarchical detection framework for the clothing domain.    -   5. A method to estimate the hierarchical/semantic tree based at        least in part on directly analyzing the detection errors.    -   6. Using the estimated hierarchy tree to demonstrate addition of        novel category object and performing search.

In an embodiment, object detection computes bounding boxes and thecorresponding categories for all the relevant objects using visual data.The category prediction often assumes that only one of the K totalobject categories is associated with each bounding boxes. The 1-of-Kclassification is often achieved by a ‘Softmax’ layer which encourageseach object category to be far away as possible from all the otherobject categories. However, this process fails to exploit thecorrelation information present in the object categories. For example, a‘jeans’ is closer to ‘pants’ compared to ‘coat’. In an embodiment,exploitation of this correlation is accomplished by first predicting‘lower body’ and choosing one element from the ‘lower body’ categorywhich is a set of ‘jeans’, ‘pants’, ‘leggings’ via hierarchical treeprediction.

In an embodiment, a hierarchical prediction framework is integrated withan object detector. FIG. 22 shows the changes between the generic objectdetector and an object detector in accordance with an embodiment. Insome embodiments, the generic detector can be any differentiable (e.g.,any deep learning based detector) mapping f(I)

bb, c that takes an input image I and produces a list of bounding boxesbb and a corresponding category c for each of the bounding box. Thehierarchical detector learns a new differentiable mapping fh(I)

bb F(c) that produces a path/flow from root category to the leafcategory F(C) for each bounding box. A differentiable mapping, in anembodiment, is a mathematical function that can be differentiated withrespect to its parameters to estimate the value of those parameters fromground truth data via gradient-based optimization.

FIG. 23 shows an illustrative example of a category tree representingnodes at various levels, in an embodiment. In an example implementation,there are two steps involved in going from a generic detector to thehierarchical detector. The first step, in an embodiment, is to train ageneric detector and estimate the category hierarchy tree as discussedbelow. Based on the category hierarchy, the deep learning framework isretrained with a loss function designed to predict the hierarchicalcategory as detailed below.

As an illustrative example, for the remainder of this disclosure, the‘Softmax’ function will be used to predict the category c by choosingthe category with the highest probability. It may be noted, however,that one with ordinary skill in the art would recognize other functionsthat can be used instead of or in addition to the ‘Softmax’ function.Other functions that can be used include, but are not limited to, anyfunction whose range is positive. Other examples are a modulus function(|x|) and a squared function (x{circumflex over ( )}2). To go from thesevalues to probability value, one may divide the function value for eachcategory by the sum across all the categories. If a generic detectordoes not predict a probability score for each category, in anembodiment, the ‘Softmax’ function (or other function) is used toconvert raw scores to a relative probability measure.

In an embodiment, a directed graph is generated from the tree. Thedirected graph underlying the tree is used for predicting a tree/pathfrom root node to leaf node for categories. Let T represent the entiretree consisting of all the categories as nodes and the hierarchicalrelationship as directed edges from parent node to children nodes. Theterms/phrases n, s(n), p(n), F(n) denote the node, sibling set of anode, parent of a node, and path from the root node to a leaf node,respectively. Consider a dummy directed graph as shown in FIG. 23 . Inthis example, all the nodes belonging to ‘Level 0’ are denoted as rootnodes since they do not have any parents. Sibling s(n) denotes all thenodes that are on the same level and have a common parent. For example,s(1)=1, 2, 3 and s(6)=4, 5, 6. Path from the root to leaf node includesall the nodes that lie in the way from a ‘Level 0’ node to a leaf node.For example, F(9)=1, 6, 9 and F(2)=2.

The estimated probability of any node (or the category probability for abounding box) is represented by P(I). Using the underlying graph, thisprobability can also be expressed by a series of conditional probabilityover the path from root node to the leaf node.

P(n|I)=(l0|I)P(l1|l0) . . . P(n|lq−1)  (1)

-   -   where q is the total number of nodes along the path and all the        nodes in the conditional probability computation belong to the        path from root to the leaf node, F(n)=(l0, l1, . . . , lq−1,n).        In an embodiment, the ‘Softmax’ layer is used to estimate the        probability of each node. The nodes are represented in a single        vector and have the last fully-connected (FC) layer predict        scores for all of the nodes. The underlying structure of the        category tree is used to obtain probability for nodes at each        level. For example, for a zeroth-level node, one can calculate        the probability as

P(l0|I)=exp c0 ci∈s(l0)exp ci  (2)

-   -   where the ‘Softmax’ is only computed with respect to the sibling        nodes. This encourages competition (1-of-K classification) only        amongst the sibling. In an embodiment, the category estimator        will first try to separate between major categories such as        ‘upper body’, ‘lower body’, ‘footwear’, subsequently estimate        finer category for each of those categories, and so-on.

To adapt a generic detector to hierarchical detector, cross-entropybetween the predicted distribution in Equation 1 and the ground-truthannotation is used:

L(I)−xq(x|I)log P(x|I)  (3)

-   -   where x are the individual elements of the vector representing        all the categories, P(I) and q(I) denote the category        probability and annotation vector for image I, respectively.        Both of these vectors are of dimension T which, in this example,        is also the total number of categories. The generic detector has        just a single active element (a single category) in the        annotation vector but, in some implementations, may have        multiple activations to account all the labels from root node to        the leaf-node.

In an embodiment, the backward propagation step is modified to learnparameters of the deep neural network that can predict hierarchicalcategories. The usage of sibling level ‘Soft-max’ and the underlyinggraph structures induces a multiplier factor for each category. Considerthe graph in FIG. 23 , and assume that an input image has category 9.The presence of category 9 also indicates the presence of categoriesalong the path from leaf to root (6, 1). The loss represented inEquation 3, in an embodiment, has at least three different active labels(1, 6, 9). The loss for this image can be written as

$\begin{matrix}{{L(I)} = {{- \left( {{{\log{P\left( {1❘I} \right)}} + {\log{P\left( {6❘I} \right)}} + {\log{P\left( {9❘I} \right)}}} = {{{- \log}{P\left( {1❘I} \right)}} + {\log{P\left( {6❘1} \right)}{P\left( {1❘I} \right)}} + {\log{P\left( {9❘6} \right)}{P\left( {6❘1} \right)}{P\left( {1❘I} \right)}}}} \right)} = {{- \left( {{3\log{P\left( {1❘I} \right)}} + {2\log{P\left( {6❘1} \right)}} + {\log{P\left( {9❘6} \right)}}} \right)}(4)}}} & (4)\end{matrix}$

Equation 4 demonstrates that, to perform back-propagation to learn theweights of the network, a multiplier factor for all of the nodes may beused. The above example can be generalized and an algorithm to estimatethe multiplier factor for each node in Algorithm 1 is presented.Intuitively, in some implementations, the loss function requires thedeep neural network to ensure representation of various paths to a leafnode leading to representation of hierarchical information. In oneexample, given the category tree T and ground truth annotation q(x|I)for an image I, the leaf node is estimated and subsequently assigned thelevel-distance from the leaf node as multiplier factor for all thenodes. The multiplier factor is zero for all the nodes with level higherthan the leaf node in annotation.

Data: q(I),T Result: Multiplier factor m(n) for all nodes Initializem(n)= 0 ∀ n ∈ T; Find leaf node lq from q(x|I); // Traverse over allnodes in path from leaf to root for li= lq to l0 do | m(n) = (q −i+1) ∀n ∈ s(li); End

Example Algorithm 1: Multiplier Factor Estimation for Each Node

To estimate the category tree T, in an embodiment, one estimates thevisual similarity between various categories. Techniques disclosed andsuggested herein improve on conventional techniques by organizing thevisually similar categories for an object detector. Much prior work hasfocused on using attribute-level annotations to generate annotation taghierarchy instead of category-level information. However, such an effortrequires large amounts of additional human effort to annotate eachcategory with information such as, viewpoint, object part location,rotation, object specific attributes. Some examples generate anattribute-based (viewpoint, rotation, part location etc.) hierarchicalclustering for each object category to improve detection. In contrast,some embodiments disclosed herein, use category level information andonly generate a single hierarchical tree for all the object categories.

Example implementations of the present disclosure estimate a categoryhierarchy by first evaluating the errors of a generic detector trainedwithout any consideration of distance between categories andsubsequently analyzing the cross-errors generated due tovisual-similarity between various categories. In an embodiment, aFaster-RCNN based detector is trained and detector errors are evaluated.For instance, a false positive generated by generic detector(Faster-RCNN detector in the current case) can be detected and some orall the errors that result from visually similar categories arecomputed. These errors, for example, may be computed by measuring allthe false positives with bounding boxes having anintersection-over-union (“IOU”) ratio between 0.1 to 0.5 with anotherobject category. In this manner, visually similar classes such as‘shoes’ and ‘boots’ will be frequently misclassified with each otherresulting in higher cross-category false positive errors.

In an embodiment, a cross-category false positive matrix D(Size(D)=J×(J+1)) is computed, where J denotes the total number ofcategories in the dataset. In this example, the second dimension ishigher than the first dimension to account for false positives that onlyintersect with background. The diagonal entries of the matrix D, in thisexample, reflect the false positives resulting from poor localizationand are ignored for the current analysis, although may be used in someimplementations. Example Algorithm 2 describes the process used toobtain the category tree. Using the matrix D and a predefined thresholdT, we estimate the sets of categories that are similar to each other.This results in disparate group of categories. All the sets in T withgreater than 1 element are given new category names and all the elementsfor that set are assigned as children to the newly defined category. Theabove process readily generates a 2-level tree for categories.

Data: C,τ Result: T Initialize T = ø; for i = 1 to J do  | for j = 1 toJ do  |  | if C[i][j] ≥ τ then  |  |  | if i ∥ j ε n; n ε T then  |  | |  | // Add to the existing group;  |  |  |  | n = n ∪ {i, j}  |  |  |else  |  |  |  | // Start a new group;  |  |  |  | n = {i, j};  |  |  | | T = T ∪ n  |  |  | end  |  | end  | end end

Example Algorithm 2: Generating Visually Similar Groups fromCross-Category False Positive Error Matrix

Some techniques focus on using attribute-level information apart fromthe category specific information to perform detection for novel objectcategories. Some examples use attribute-level information to detectobjects from novel categories. For instance, a new object category‘horse’ is recognized as a combination of ‘legs’, ‘mammal’ and ‘animal’categories. Attribute-based recognition requires one to learn attributespecific classifiers and attribute-level annotation for each of objectcategories. In comparison, some embodiments of the present disclosureneither require attribute annotations nor any attribute specificclassifiers. For each new category, an expected root-level category maybe assigned and subsequently a bounding box with highest confidencescore for that category may be estimated.

Systems operating according to various embodiments disclosed hereinperform category specific non-maximal suppression to select boundingboxes for each leaf node categories, where the bounding boxes may beunique. For all the lower level categories, such systems may alsosuppress the output by considering bounding boxes from all the childrennodes. In some embodiments, this helps reduce spurious lower levelcategory boxes whenever bounding boxes from more specific categories canbe detected.

In various implementations, the detector serves two purposes for similarobject matching; region of interest detection and categoryidentification. Region of interest detection, in an embodiment, is usedto help crop the image to only contain the relevant object. Categoryidentification on the other hand, in an embodiment, is used to narrowdown the number of clothing images to be searched. For example, if thedetector detects a ‘dress’ object, then the search can be limited to bewithin the ‘dress’ clothing database. In case of a novel category, sincethere is only a root-level node category, search for similar clothingitem among the images of children of the root-level node can beperformed.

To test our formulation, a large dataset of 97,321 images from variousfashion relevant websites, such as ‘www.modcloth.com’, and‘www.renttherunway.com’ were collected. For all the images,human-annotation for all the fashion relevant items resulting in a totalof 404,891 bounding boxes across 43 different categories were obtained.All the categories that have less than 400 bounding boxes for trainingthe object detector resulting in 26 valid categories were ignored,although different parameters may be used. The statistics of the datasetare provided in Table 1. The dataset was split into training and testingset 80-20. All the detectors were only trained using the training dataand their performance is evaluated using the same test set.

In these examples, the open-source deep learning framework CAFFE may beused. For learning, we use stochastic gradient descent with 0.001 asbase learning rate which is reduced to half every 50,000 iterations,momentum of 0.9 and 0.0005 as weight decay. For both the detectors, weuse all the same hyperparameters and train the detectors for 200,000iterations.

In an embodiment, an average precision for the different categories isdetermined and the results are summarized across categories using themean average precision. Average precision measure the area under theprecision-recall curve for each object category. In an example, 0.5pascal ratio is used as the threshold for true positive. The baselinegeneric detector on our dataset is trained to compute the cross-errormatrix C.

TABLE 1 Total Number of bounding box annotations for each categoryCategory Num. Annotations Shoes 78835 Jeans 12562 Boots 17503Tanks/Camis 3532 Rompers/Overalls 827 Tunics 1863 Scarves/Wraps 2429Coats/Jackets 9169 Handbags 16706 Sweater 8006 Dresses 40489 Pants 6239Clutches 4289 Shorts 3392 Leggings 1272 Sandals 8293 Tees 3528Beanie/Knit Cap 513 Tote 434 Belts 9910 Cowboy Hats 2315 Blouse/Shirt17606 Glasses 15859 Suitings/Blazer 564 Skirts 8239 Jumpsuits 1211

FIG. 24 shows an illustrative example of a normalized error matrix, inan embodiment. FIG. 24 illustrates a cross classification matrix withfalse positive errors between various categories. From FIG. 24 , it isclear that the visually similar categories like ‘shoes’ and ‘boots’ arefrequently misclassified with each other. We used Example Algorithm 2 toestimate the tree T based on detector error matrix C. Our algorithmfinds 7 groups containing more than one element. Details of all thegroups thus generated and their names are given in Table 2.

Table 3 shows the mAP comparison between the generic and the proposedhierarchical detector. Since the generic detector, in this example, didnot generate any of the newly generated groups, we generate AP resultsfor the new categories by averaging the performance across theirchildren. This is reasonable since the detection of ‘Dress’ or‘Jumpsuits’ also indicates the presence of ‘Full Body’ clothingcategory. Our results show that the hierarchical detector improves themAP by approximately 4% over the generic detector, at least in thiscontext with the data that was used.

TABLE 2 New root-level categories and their children Composite OriginalCategory Footwear Shoes, Boots, Sandals Full Body Dresses, Jumpsuits,Rompers/Overalls Top Innerwear Blouses/Shirts, Tees, Tanks/Camis,Tunics, Sweater Top Outerwear Coats/Jackets, Suitings/Blazers BagsHandbags, Clutches, Tote Lower Body Jeans, Pants, Leggings HeadgearCowboy Hat, Beanie/Knit Cap

On the original classes, the mAP of both the generic and hierarchicaldetector is the same, indicating no degradation of the underlyingnetwork despite the increased number of categories. Notably, theimprovement in the performance of the hierarchical detector is becauseof the ability to capture visual information at a higher level.

TABLE 3 Total Number of bounding box annotations for each categoryCategory Generic Hierarchical Shoes 0.8857 0.8814 Jeans 0.8974 0.892Boots 0.7736 0.7679 Tanks/Camis 0.4763 0.4721 Rompers/Overalls 0.37330.4125 Tunics 0.2095 0.1987 Scarves/Wraps 0.3815 0.3309 Coats/Jackets0.7918 0.8068 Handbags 0.7906 0.7995 Sweater 0.672 0.6613 Dresses 0.97020.9698 Pants 0.598 0.5876 Clutches 0.6407 0.6384 Shorts 0.8287 0.8293Leggings 0.1705 0.1636 Sandals 0.6223 0.6167 Tees 0.4856 0.4797Beanie/Knit cap 0.7104 0.67 Tote 0.1708 0.2009 Belts 0.2265 0.2054Cowboy hat 0.9151 0.9197 Blouses/Shirts 0.6776 0.6693 Glasses 0.74980.7414 Suitings/Blazers 0.1369 0.1536 Skirts 0.6797 0.6663 Jumpsuits0.6734 0.6998 Footwear 0.7605 0.8870 Headgear 0.8127 0.8705 TopInnerwear 0.5042 0.7525 Top Outerwear 0.4643 0.7215 Full Body 0.67230.9294 Lower Body 0.5553 0.9153 Bags 0.534 0.7288 mAP 0.6003 0.6440

FIG. 25 illustrates an example of a hierarchical detector that cancorrect for missing detections from a generic detector, in anembodiment. A hierarchical detector can correct for missing detectionsfrom the generic detector for ambiguous examples. For example, it ishard to clearly identify the type of ‘top inner-wear’ occluded by a‘coat’ or ‘jacket’. But the hierarchical detector can still detect thatthe clothing item hidden underneath is an instance of ‘top innerwear’because of hierarchical information representation. FIG. 25 shows someexamples of ambiguous instances that are identified by the hierarchicaldetector. Furthermore, the hierarchical detector encourages competitionbetween siblings because instead of separation of one category from allthe other categories, the hierarchical detector only separates amongstsibling categories.

FIG. 26 shows an illustrative example of how a hierarchical detectorsuppresses sibling output in contrast to a generic detector, in anembodiment. In an embodiment, a generic detector predicts two differentbounding boxes for two sibling categories which are suppressed by thehierarchical detector. The hierarchical nature of our detection outputallows us to represent information at various scales. For example, the‘Top Innerwear’ category captures the commonalities between all thechildren categories. We use this aspect of our framework to performdetection on a novel category that our detector has never seen duringtraining. For each novel category, we assign a root-level category andcompute the maximum confidence detection for all the children androot-level category. We collect a small test-set where the genericdetector fails because these are novel categories. The results of thisset are demonstrated in Table 4.

TABLE 4 Detection Performance on Novel Categories Root Total True FalseCategory Category Images Positive Positive Polos Top Innerwear 165 157 8Hoodies Top Innerwear 239 215 14 Briefcase Bags 132 132 0

Techniques described and suggested herein provide a novel framework forpredicting hierarchical categories for a detector. The hierarchy betweencategories, in various embodiments, is only based on visual similarity.An example implementation of the hierarchical detector demonstrates theability to capture information at various scales and generalizes thedetector to novel categories that our detector has not been trained on.

FIG. 27 shows an illustrative example of a graphical user interface thatcan be used in connection with various embodiments discussed herein. Thegraphical user interface can be provided in various ways, such as in aweb page accessible through a web browser, an application on a mobile orother device, or in other ways. In the left of FIG. 27 is an example ofan image that has been uploaded or otherwise made accessible to a serverof a computer system (which may be a single device or a distributedcomputer system comprising multiple devices). The techniques describedabove may be used to detect clothing objects in the image. In thisexample, as illustrated by boxes surrounding each detected object, sevenobjects (a pair of sunglasses, a tank top, a blouse, a handbag, a leftshoe, a right shoe, and a pair of shorts). Further, in this example, dueto visual similarity between tank tops and blouses, both options aregiven in the right side of the interface to provide users greaterchoices and more results, although in some embodiments, one or the othermay be selected and respective results may be provided without resultsassociated with the unselected category.

The graphical user interface may be used, for instance, as part of aservice that enables users to upload or otherwise specify images (e.g.,via URL) to be analyzed to detect which clothing objects appear in animage, to select a clothing object detected in the image, and to performa search for similar objects. In an illustrative example, a selectedclothing object may be used to determine search terms for a search querythat may be performed against one or more databases (e.g., via aninterface to an Web service platform). As an example, detection of apair of shorts may result in a search query including the term “shorts.”Other information about the image may be used to determine terms andother parameters for a search query. For instance, the image may beanalyzed to determine whether the shorts are primarily designed forwomen or men. This may be performed by analyzing the shorts themselvesusing techniques described above and/or by detecting the presence of awoman's face being associated with the shorts detected in the image.Color and other attributes of the detected object may also be used todetermine parameters for the search query.

FIG. 27 shows an illustrative example of a graphical user interface thatenables utilization of techniques described herein, in an embodiment. Insome examples, parameters for the search query are automaticallygenerated and provide for modification in the user interface. In theabove example, the user interface may indicate that it detected whiteshorts and the user may be able to deselect a “white” parameter toindicate that color should not be used to limit search results and/or toselect a different color to be used to filter search results.

In the example graphical user interface of FIG. 27 , each detectedobject is provided with a row of search results for a query submittedfor that object. In this manner, a user can select which object(s) areof interest and view applicable search results. As noted, in someimplementations, a user is able to select which objects are of interestto the user and, as a result, which search results appear in the userinterface. For instance, in an embodiment, if a user was presented withthe user interface shown in FIG. 27 , he or she could select the shortsobject in the image (e.g., with a mouse click or touchscreen touch) andsearch results for a query generated based at least in part on thedetected shorts would appear in the user interface (perhaps replacingresults for other objects appearing in the user interface).

While FIG. 27 is used for the purpose of illustration, numerousvariations are considered as being within the scope of the presentdisclosure. Further, while clothing and categories of clothing were usedfor the purpose of illustration, one with ordinary skill in the artwould recognize the applicability of the techniques described herein toother contexts where items can be categorized hierarchically. Examplesof such contexts include any domain where one of the sensing modalitiesis vision and the output has a hierarchical semantic organization. Someexamples are detecting food items from images, detecting specific typesof animal breeds from images (breeds of dogs will share information,breeds of cats will also share information), and identifying plantspecies from images of a leaf.

FIG. 28 is another example of the graphical user interface of FIG. 27 ,where each visible item of clothing is uniquely determined (i.e.,without any object of clothing being identified as two categories). Theinterface of FIG. 28 may operate similarly as with FIG. 27 .

Visual search or the process of matching products in various images ischallenging because of scale, lighting, camera pose, blur, occlusion andother visual distractors. Some examples use a two-step matching process,first detecting the high level category (example: dresses) in the imageand subsequently matching the detector output to the images within thesame high level category. The two step matching process avoids the needto classify an image to the large number of possible products (varioustypes of dresses), helps identify region of interest in the image andreduces the search space for the matching process. In an embodiment, acomputer system utilizes a novel deep neural network for image-to-imagematching/retrieval after the high-level category detection. This networkpools features from various early layers of a deep neural networkenabling the network to focus and represent the subtle differencesbetween different products of the same high level category. A frameworkconstructed utilizing techniques described herein, in an embodiment,handles the domain differences by proposing a triplet learning frameworkwhich adapts the learning process to account for domain variations. Theproposed framework doubles the retrieval accuracy on a large open sourcedataset such as DeepFashion, while using significantly fewer annotationsper image.

As an illustrative example of one embodiment, imagine you are walkingdown on the streets of New York and you notice an interesting outfitthat you would like to buy. The method described herein allows onesimply to take a picture and then buy the exact/similar clothing from anonline service. This problem may be referred to as street-to-shop imagematching. Street-to-shop image matching is different from general imageretrieval where both the query image and the images from the databasehave similar image characteristics. In the street-to-shop matchingproblem, street images (images from a realistic image source) havecomplicated backgrounds with multiple products, varying lightingconditions, motion blurs and other imaging noises. In contrast, shopimages which constitute a typical online retailer's catalogue usuallydisplay a single product with clean background and perfect lightingconditions.

FIG. 29 shows an illustrative example of a triplet with overlaidbounding boxes, in an embodiment. In an embodiment, a computer systemutilizes a framework that specifically addresses the domain differencebetween street and shop images for exact product retrieval. Theframework, in some implementations, uses triplets of three images in theform (street image, actual shop product, different shop product). Basedon these triplets, the weights of a deep neural network are learnedthrough a machine learning process to cause the deep neural network toencourage the street image to be closer to the actual shop product andfar away from a different shop product. FIG. 29 shows a typical exampleof a triplet, which we use to train a single network to bring a similarstreet product and shop product closer to each other and separatedifferent products from each other.

Such techniques as described herein provide technical advantages overother techniques, such as those that ignore the street-to-shop domaindifference or process each domain with a separate network. For example,ignoring the domain difference does not model the problem at hand, andusing a separate domain for street and shop can double the number ofparameters in the overall framework and hence requires significantlymore data to train. In an embodiment, a computer system uses a singlenetwork for both domains by forming triplets, choosing a street image asanchor and using shop images as positive and negative images within atriplet learning framework.

The underlying network for a triplets-based learning framework, in anembodiment, involves computation of features that can representfine-grained differences in various clothing products. In an embodiment,the computer system uses convolutional neural networks (“CNN”) as theunderlying learning/functional representation ƒ(θ) (omitting theparameters θ for easy representation later) because CNNs have resultedin state-of-the-art performance for a variety of computer vision tasks.CNNs also represent increasingly abstract information from an imagealong the depth of a network. To address the exact/similar clothingretrieval problem, subtle differences such as the different collars oftwo almost same dresses are addressed. To address the complexity ofexact clothing retrieval, systems in accordance with the presentdisclosure use a novel network, MarkableNet, which combines informationfrom multiple scales.

The novel network, in an embodiment, summarizes information frommultiple convolutional layers of a single model (MarkableNet) for thefashion retrieval problem, and the model achieves state-of-the-artretrieval performance on various public fashion datasets.

The techniques of the present disclosure use a novel way of handlingdomain differences by designing triplets that avoids learning adifferent network for each domain. In an embodiment, MarkableNetcombines information from different layers of a network. Such trainingmay be performed online. In some embodiments, techniques of the presentdisclosure avoid the region proposal pooling by pre-extracting only arelevant (clothing) region of interest in an image by using our fashiondetector, which may be implemented as a computer system programmed toperform operations such as described herein.

Metric learning aims to learn an embedding space in which similarobjects are closer and dissimilar objects are far away from each other.In the context of retrieval problems, this specifically refers toranking-based metric learning which often uses the triplet form. Atriplet refers to (x, x⁺, x⁻) in which anchor objects x are more similarto positive objects x⁺. than negative objects x⁻. Metric learning aimsto learn a mapping ƒ, such that ƒ(x, x⁺)>ƒ(x, x⁻). Different approacheshave been proposed to achieve this:

In an embodiment, contrastive loss (pairwise contrastive loss) isdefined on pairs of samples. It encourages small distances betweenintra-class pairs and requires interclass pair distances to be largerthan a margin. However, contrastive loss only focuses on absolutedistances, whereas for the ranking problem, relative distance is moreimportant than the absolute distance. Away of calculating distanceincludes, but is not limited to, Euclidean distance (e.g., sum ofsquared distances between individual components of a vector) where adeep neural network can be used to transform an image to a vector andthen the vector is used as a representation of the input image.

Triplet Loss:

Defined on triplets of samples, triplet loss tries to pull the anchorsample and the positive sample closer, while pushing away the anchorsample and the negative sample such that the difference betweenanchor-positive distance and anchor-negative distance is larger than amargin.

In practice, to achieve best performance, triplets training requireshaving enough valid triplets in each batch so the network can keeplearning. In some embodiments, instead of pre-computing valid triplets,generating triplets for each batch in an online manner reduces memoryconsumption, thus enabling more triplets in each batch, leading tobetter model performance.

Beyond single triplet: Due to the huge sampling space of triplets, theconvergence rate of triplets training is usually slow. In variousexamples, many variations of loss functions may be used to incorporateinformation beyond a single triplet. Some examples use Lifted StructureEmbedding where each anchor-positive pair is compared with all thenegative samples in the batch weighted by the margin violation. Acluster loss function is defined that encourages a margin between groundtruth clustering assignment and optimized clustering assignment based oncomputed embeddings of each batch. In some examples, N-Pair Lossenforces Softmax cross-entropy loss among the pairwise distances in thebatch. An extension of N-Pair loss to multiple positives is NCA loss.Apart from exploiting information inside each batch, both Magnet Lossand metric learning using Proxies try to utilize global information ofthe whole embedding space during the training of each batch.

TABLE 5 Fashion dataset comparison. cross: (images of) products thathave both street images and shop images; all: total products and images;human: if bounding box annotation is human annotated. Markable^(ii)contains human annotated bounding boxes from Markable^(i) and boundingboxes detected by Markable internal fashion detector. Bounding box # ofproducts # of images annotation # of Dataset cross all cross all streetshop human categories Where2BuyIt 10.3 k 204 k 64.5 k 425 k ✓ x ✓ 11DeepFashion 33.9 k 239 k ✓ ✓ x 23 Markable^(i) 19.5 k 22.3 k 72.45 76.8k ✓ ✓ ✓ 35 Markable^(ii) 25.1 k 25.4 k 307 k 308 k ✓ ✓ ✓, x 35

All these methods, which can be used in combination with the methodsdescribed herein in various embodiments, share the spirit of exploitingglobal information, as it provides a consistent way to shape theembedding space compared to learning through single triplets.Empirically, we found the global information methods to yield bettermodels than training with single triplets.

In some examples, fashion recognition is implemented. Compared torecognition of landmarks and rigid objects, recognition on a fashiondomain is a challenging problem because of the deformable nature of mostfashion items. Previous work has explored a variety of computer visionproblems ranging from attributes prediction and landmarks prediction toclothing retrieval. In this work, we focus on the problem ofcross-domain image retrieval from street to shop. Note that thetechniques described herein and variations thereof are applicable toother domains, such as other domains where the objects can have adeformable nature. Examples include, but are not limited todomains/contexts where image-to-image matching tasks are used where theimage of an object presents enough information about the object beingdepicted. Examples include images of cars (e.g., by comparing an imageof a car to an online car retailer's images), matching images of streetsigns against stock photos of street signs, matching an image of a houseagainst other images of the same house, matching furniture/indoor decoritems against an online retailer's catalogue, and others.

In general, solely using semantic features from last layers does notresult in best retrieval performance. Mid-level features as well aslow-level features also play an important role in the retrieval process,especially in fashion retrieval, where differences between fashion itemsare subtle. The system achieves this feature combination by learning asingle network that summarizes semantic information from various layers.

As shown in table 5, Where2BuyIt and DeepFashion are open sourcedatasets on fashion recognition. Where2BuyIt contains approximately 204k products, however, only 10 k products have street images, and all theshop images don't have bounding box annotations. DeepFashion contains 34k products that have images from both domains However, its imageresolution is low compared to Where2BuyIt, and the bounding boxannotations are inaccurate.

Markable^(i) and Markable^(ii) are Markable's internal datasets. Suchdata sets may be obtained in various ways in accordance with variousembodiments. For example, in some embodiments, a web scraper isprogrammed to scrape websites to obtain images and metadata about suchimages. As another example, a data set may be obtained by generatingimages with a digital camera and with human entry of metadata regardingthe images using a computing device. Generally, any way of obtaininginput data is considered as being within the scope of the presentdisclosure. With rigorous data cleaning and a human annotation pipeline,a computer system chooses images with high resolution (e.g., resolutionabove a threshold and/or resolution relatively higher than other images)and ensures accurate bounding boxes and pair information. For Markable,most products have 2 street images and 2 shop images, and all images, inan embodiment, have human annotated bounding boxes, while most productsin Markable^(ii) have many more street images and 2-5 shop images, andbounding boxes on these extra images are detected using Markable'sfashion detector. Overall, compared to Where2BuyIt and DeepFashion,Markable datasets are well curated and hence suitable for the trainingand testing of cross domain fashion retrieval.

A computer system employing learning techniques described herein uses asingle network for cross-domain retrieval that is trained end to end. Todesign a network best suited for retrieval, the following are taken intoaccount: i) feature representation across layers, ii) feature weightingfrom a layer, iii) combining features from multiple layers.

TABLE 6 Top-20 recall on DeepFashion dataset for different featurerepresentations. R@20 training testing testing feature representation(%) set query gallery L2(PCA₃₀₀(fc7))∥H_(color) 3.4 — ▪♦● ▪♦●L2(PCA₃₀₀₀(pool5∥fc7)) 7.79 — ▪♦● ▪♦● MP(pool5) 3.08 — ▪♦● ▪♦●L2(MP(pool5)) 5.40 — ▪♦● ▪♦● L2(SP(pool5)) 7.56 — ▪♦● ▪♦● L2(SP(conv5))7.57 — ▪♦● ▪♦● L2(SP(conv4)∥ 7.70 — ▪♦● ▪♦● SP(conv5)) L2(L2(SP(conv4))∥9.25 — ▪♦● ▪♦● L2(SP(conv5))) L2(L2(SP(conv4))∥ 11.44 — ♦● ♦●L2(SP(conv5))) L2(L2(SP(conv3))∥ 9.28 — ▪♦● ▪♦● L2(SP(conv4))∥L2(SP(conv5))) R@20 training testing testing Models (%) set querygallery VggEmb-tri 14.2 ▪ ♦● ♦● VggEmb-tri pre- 17.8 ▪ ♦● ♦●trained@Markable^(i) MarkableNet-tri 26.7 ▪ ♦● ♦● MarkableNet-tri pre-33.6 ▪ ♦● ♦● trained@Markable^(i) FashionNet 18.8 ? ♦● ♦● L2: l2normalization of features; PCA_(d): PCA dimensionality reduction todimension d; SP/MP: SUM/MAX pooling for each feature map; ∥:concatenation of features; H_(color): color histogram; conv4/conv5:conv4_3/conv5_3 feature etc.; ▪♦●: training/validation/testing splits ofDeepFashion dataset. In FashionNet, model is trained on ▪ and tested on♦●.

FIGS. 30 and 31 show an illustrative example of a network design thatcaptures both coarse-grained and fine-grained representations of fashionitems in an image, in an embodiment. In an embodiment, the computersystem uses a pre-trained VGG-16 neural network to test thestreet-to-shop clothing retrieval problem on the DeepFashion dataset.From the measurements (Table. 6), insights on how to form good featurerepresentation for the retrieval problem can be drawn: (i) mid-levelfeatures from conv layers are superior to semantic features from ƒclayers; (ii) L2 normalization of features before concatenation helps;(iii) Sum pooling performs better than Max pooling; (iv) improvementfrom concatenation of lower level features (before conv4 layer) istrivial with these insights, and MarkableNet as shown in FIGS. 30 and 31is obtained. This network design, in an embodiment, explicitly capturesboth coarse-grained and fine-grained representations of fashion items inan image resulting in a significant performance boost on thestreet-to-shop retrieval problem.

MarkableNet is based on VGG-16 structure, but, in an embodiment, allfully connected layers after conv5_3 layer are removed. SUM Pooling isapplied on each feature map of conv4_3 and conv5_3 before the original2×2 pooling operation, which gives us two 512 d features. Empirically,it is seen that, in some implementation, direct concatenation of abovefeatures resulted in instability while training the network. In anembodiment, this is addressed by adding L2 normalization beforeconcatenation. However that seems to simply avoid the issue viarescaling without a significant increase in performance. Instead of L2normalization, an embodiment using a batch normalization layer beforeconcatenation solves the feature scale problem. Following theconcatenated 1024 d feature, two fully connected layers are added so thenetwork can have enough capacity to handle different feature scales andvariations coming from different layers. Batch normalization is alsoapplied after both fully connected layers, and a drop-out layer is notused. The embeddings from MarkableNet are 128 d features which have asignificantly lower memory footprint that most other retrieval methods.

FIG. 32 shows an illustrative example of how batches are formed togenerate triplets online, in an embodiment. In FIG. 32 , n_(a) and n_(p)are the number of street images and the number of shop images perproduct. For each batch, k products are selected, and for each product,n_(a) street images and n_(p) shop images are randomly selected by acomputer system performing the techniques described herein. To learn amodel for retrieval from street domain to shop domain, cross-domaintriplets are used, where anchors are from street domain, and positivesand negatives are from shop domain. More specifically, for each product,one of its n_(a) street images is selected as anchor, one of its n_(p)shop images is selected as positive, and one shop image of otherproducts in the batch is selected as negative. In this way, it is ableto generate a large amount of triplets while only forwarding the uniqueimages once through the network. The triplet loss is defined as:

$\begin{matrix}{{\mathcal{L}\left( {X,y} \right)} = {{\frac{1}{❘\tau ❘}{\sum\limits_{({x_{i},{x_{i,}^{+}x_{i}^{-}}})}\left\lbrack {{D^{2}x_{i,}x_{i}^{+}} + \alpha - {D^{2}x_{i,}x_{i}^{-}}} \right\rbrack}} +}} & (1)\end{matrix}$

where τ is the set of cross domain triplets, x_(i) is from streetdomain, x_(i) ⁻ and x_(i) ⁻ are from shop domain,

is distance, and [⋅]₊ is hinge loss. The L2 is used as a metric inexperiments, in an embodiment, although other suitable metrics can beused.

Other than triplet loss, loss functions are defined. For example, in anembodiment, n-pair tuples are constructed by selecting one shop imageper product and using N-Pair loss. Since N-Pair loss is a special caseof NCA loss, in an embodiment, NCA loss is used, where NCA loss isdefined as:

$\begin{matrix}{{\mathcal{L}\left( {X,y} \right)} = {\frac{1}{N}{\sum\limits_{i}{\log\frac{\Sigma_{j \in C_{i}}e^{- D_{{xi},{xi}}}}{\Sigma_{j \in C}e^{- D_{{xi},{xi}}}}}}}} & (2)\end{matrix}$

where i is from street domain, and all shop images of its correspondingproduct composes C_(i), while C is the set of shop images from all theproducts within the batch.

To improve the performance, previous works have proposed using negativemining. For the negative mining method, semi-hard negative mining andrandom hard negative mining are evaluated. Such mining methods do notwork as well as using all valid triplets in terms of training stability.Instead, in the late stage of training, in an embodiment, a hardnegative products mining step, which aims at forcing the network tolearn fine-grained subtleties, is used. Hard negative products miningcan be used for any metric. We illustrate the mining steps in FIG. 33 .

For each query street image x_(i), a set

_(i) is formed, which contains approximately Δ−1 similar products as thequery product. More specifically, each query yields a responsecontaining distances to all the shop images from the database. To formthe set

_(i), we rank the distances in an increasing order. Considering theranked shop images, if the position of first exact product is greaterthan the mining window size, in an embodiment, then the mining windowwill fully reside in the left side (e.g. x_(i) in FIG. 33 ). In the caseof the first exact product position being less than Δ, the mining windowwill extend to the right side (e.g. x_(i+1) in FIG. 33 ) in order tofind a total of Δ−1 similar shop images.

_(i) is then composed of the unique products within the mining window.In the case of duplicate products within the mining window,

_(i) will contain less than Δ products.

To form batches, the query image and its mining window's product shopimages are used as preset images. Images of products in

_(i) are randomly sampled so that each product has n_(a) street imagesand n_(p) shop images in the batch. In some implementations, it willalso be necessary to append randomly sampled products due to fixed batchsize. FIG. 33 showcases the hard negative products mining steps withsetting n_(a)=2, n_(p)=2, Δ=4 and batch size being 32.

In an embodiment, a batch size of 144 is used. The system is testedusing different values of n_(a) and n_(p) (see FIG. 32 ). Empiricalresults demonstrate that the training converges slowly and is lessstable when using big values for n_(a) and n_(p). In practice, n_(a)=2and n_(p)=2 is generally a good setting, as it strikes a good balancebetween forcing the network to learn inter-product variations and at thesame time handling intra-product variations. As for hard negativeproducts mining, we used group size Δ=6 (see FIG. 33 ).

In an embodiment, different network structures and loss functions onvarious street-to-shop datasets demonstrate the effectiveness of ournetwork and cross-domain batch scheme. The following nomenclature isused:

-   -   VggEmb: this model has an 128 d embedding layer after fc7 layer        of VGG-16 model.    -   MarkableNet: Markable CNN model (FIGS. 30-31)    -   tri: trained with triplet loss (Eq. (1)).    -   nca: trained with NCA loss (Eq. (2)).    -   hnm: hard negative products mining.    -   M₅: our model with best retrieval performance (Table. 7).

For the training of both VggEmb and MarkableNet, gradients are backpropagated through until conv4_1 layer. Margins are set to 0.3 for thetriplet loss. The top−k recall metric is used to measuring performance,wherein a true positive is the case when the exact product is within thefirst k retrieved results.

FIG. 33 shows an illustrative example of hard negative products miningsteps, in an embodiment. Circles are street images, while stars are shopimages. Colors are used to differentiate different products. x_(i) isthe ith query street image, Δ is the mining window size,

_(i) is the set of mined similar products for x_(i). In order to formbatches, we also: i) sample images (non-filled circles and stars) ofcorresponding products in

_(i); ii) append randomly sampled products in case of duplicate productsin the mining window, so that batch images are organized as in FIG. 33 .

In an embodiment, the fine-tuning of datasets, is accomplished using 80%products for training, and 20% products for testing. As seen from Table.7, all embeddings from MarkableNet structure M₂-M₅ can achieve muchhigher recall in comparison to embeddings from VggEmb structure M₁.Training on larger dataset Markable^(ii) also boosts retrievalperformance compared to training on Markable. Furthermore, hard negativeproducts mining always helps in increasing the recall, and theimprovement is more significant on a bigger dataset. All theseimprovements from better feature representation, bigger dataset andnegative products mining are more obvious when considering challengingcases such as “Accessories” categories.

To evaluate the performance of the system, in an embodiment, MarkableNethas been tested on DeepFashion and Where2BuyIt and/or other publicdatasets. On the DeepFashion dataset, as shown in Table. 6, MarkableNetattains approximately a 40% relative increase in top-20 recall comparedto existing system performance of 18.8%. Thus the techniques describedherein comprise technological improvements for extraction of therelevant features for street-to-shop matching. Further evaluation of thecontribution of a clean dataset can be made by using M₂ (see Table. 7)as the pre-trained model. After fine-tuning, the model achievesapproximately 78% relative improvement over other solutions. Top-20retrieval recall on the Where2BuyIt dataset is given in Table. 8. Forboth cases of training with or without using Markable's internaldatasets, MarkableNet is able to achieve highest recall for mostcategories.

In an embodiment, the t-SNE algorithm is used for dimensionalityreduction and the Jonker-Volgenant algorithm for grid alignment tovisualize the embedding vectors on a subset of Markable^(i). Dress shopimages may be grouped based on factors such as color, texture and style.Similar patterns may be observed for products from different categoriesas well. In some examples, model M₅ is able to handle most of thevariations from street domain and clusters street and shop images perproduct. For example, for the dresses category, intra-product distancesand inter-product distances are well separated. Thus, the learningprocess is to pull intra-product embeddings and push inter-productembeddings. Overall, these visualizations demonstrate that featurerepresentations using the embeddings from MarkableNet is suitable forfashion retrieval.

In production, given a query image, in an embodiment, the Markableinternal fashion detector is used to detect and classify all the fashionitems in the query image, then a within-category retrieval is performedfor all the detected items using their bounding boxes and categoriesfrom the detector. It can be seen that for both top-10 hit and missingcases, most retrieved products are similar to the query items in eitherone or multiple aspects. The results also show some failure casesarising because of a large pose deformation, occlusion due to long hairand variable amounts of skin captured in a bounding box.

TABLE 7 Top-k recall on Markable datasets for different experiments. M₁:VggEmb- tri; M₂: MarkableNet-tri; M₃: MarkableNet-hnm-tri; M₄:MarkableNet-nca; M₅: MarkableNet-hnm-nca. M₁, M₂, M₃ are trained on aMarkable^(i) dataset, M₄, M₅ are trained on a Markable^(ii) dataset.“All categories” include the total 35 subcategories, “Clothing” includes17 subcategories, and “Accessories” includes 18 subcategories. R@20 (%)R@10 (%) R@2 (%) Markable^(i) M₁ M₂ M₃ M₄ M₅ M₁ M₂ M₃ M₄ M₅ M₁ M₂ M₃ M₄M₅ All 64.3 79.2 79.0 88.0 87.7 54.0 70.6 71.4 81.8 82.1 32.7 49.7 51.061.6 64.0 categories Clothing 77.0 89.3 88.9 94.3 94.6 66.9 83.5 83.890.3 91.2 44.0 65.3 66.6 74.5 77.5 Accessories 45.9 64.3 64.3 77.7 76.734.2 51.3 52.7 67.8 67.2 15.8 26.1 27.6 40.0 41.6 R@2 (%) R@10 (%) R@2(%) Markable^(ii) M₂ M₄ M₅ M₂ M₄ M₅ M₂ M₄ M₅ All categories 62.8 74.378.0 56.8 68.0 72.9 41.0 50.7 58.1

TABLE 8 Top-20 recall on Where2BuyIt dataset. VisNet and M₅ are trainedwith external data. w/o external data bags belts dresses eyewearfootwear hats leggings outerwear pants skirts tops Overall F.T.Similarity 37.4 13.5 37.1 35.5 9.6 38.4 22.1 21.0 29.2 54.6 38.1 28.97R. Contrastive & 46.6 20.2 56.9 13.8 13.1 24.4 15.9 20.3 22.3 50.8 48.037.24 Softmax MarkableNet-nca 36.7 33.3 58.5 56.9 33.1 33.8 18.5 27.544.0 74.1 42.9 41.8 w/external data bags belts dresses eyewear footwearhats leggings outerwear pants skirts tops overall VisNet — — 61.1 — — —32.4 43.1 31.8 71.8 62.6 — M₅ 55.4 19.0 84.5 72.4 62.2 41.5 15.4 60.963.6 87.3 58.6 67.4

In this work, the possibilities of constructing a good featurerepresentation for the problem of fashion retrieval are explored.MarkableNet, which uses summarization features pooled from multipleconvolutional layers of the VGG-16 model, is a novel solution to thisproblem. Two datasets are constructed as training material forMarkableNet. Results from extensive experiments show that MarkableNetprovides improved performance from both better feature descriptors and,bigger yet higher quality datasets. Substantial differences of modelperformance brought by the choice of loss function in metric learningare not found; however, the convergence rate is much faster when usingNCA loss. Hard negative products mining can be used as a reliable toolto further improve model performance.

Variations considered as being within the scope of the presentdisclosure include those using better feature representations frombetter models such as ResNet and feature pyramid networks. Differentmethods for region of interest pooling and instance level segmentationmay be used to play a role on the way to achieving human-level fashionrecognition performance.

In an embodiment, deep learning is applicable to many problems such asimage classification, object detection and segmentation. Thesedevelopments are employed to build intelligent and powerful consumerfacing products that enhance the user experience. One of theapplications of improved visual understanding is visual search. Thescope of visual search may be applied to images where both the query andthe database consist of image data. For example, a video may be used toquery against a database of images. The systems and methods describedherein are able to detect products present in images and videos. In someimplementations, individual products are identified using a database ofproduct images and videos.

In some examples, the system allows sellers to upload videos or imagesof clothing products into an electronic catalog. An electronic catalogmay be a database, data store, array, or other data structure stored oncomputer-readable media that is accessible to the system. Sellers mayupload videos or images into the catalog over a computer network, onphysical media, or by way of a camera or video capture device that isconnected to the system. In one implementation, sellers upload imagesusing a client computer system running a web browser, and the systemprovides a Web server that accepts uploaded images.

Consumers are able to search against that catalogue by providing afree-form image or video with a query request. The query request may beuploaded from a client computer system via a web browser, or usingclient software running on the client device. In one implementation, theclient software is an application running on a mobile device, tabletcomputer system, cell phone, or other appliance that includes a camera.The consumer captures an image on the client device, and using theclient software, uploads the image to the service. In oneimplementation, the client device is a cell phone, and the clientcaptures the image on the cell phone and uploads it to the service overa cellular network.

FIG. 34 shows an illustrative example of image and video productretrieval, in an embodiment. In some embodiments, fashion recognitiontechniques and applications are based on recognition from a singleimage. For example, given an input image, the system recognizes thefashion items in the image and identifies similar fashion items that areavailable from online retailers, as shown in FIG. 34 . As more consumershave access to video capture devices, recognition of a product based atleast in part on video samples is becoming more important. In someimplementations, the success of image-based fashion recognition relieson the quality of the representations learned by neural networks.

An image-based retrieval system contains a detector to detect fashionitems in the query image and an extractor to extract a featurerepresentation in an embedding space for each detected item. Using aspecific distance metric for the embedding space, the featurerepresentation of each item is used to retrieve matching and similarproducts whose features are close to the feature representation in theembedding space. The detectors and feature extractors are tolerant tovariations such as pose and lighting variations, and mild occlusionsthat are present in the images. However, many real-world video samplespose a challenge to the system, due to the presence of larger imagevariations in the video domain. As a result, application of conventionalimage based retrieval processes to the processing of video images mayfail, and the present system provides a retrieval system that istolerant to the image-quality variations often present in real-worldvideo samples. In general, image-based retrieval techniques, whenapplied to videos, tend to generate false positives and low qualitybounding box predictions, both of which pollute the inputs to theextractor and generate bad feature representations for final retrieval.

In various embodiments described herein, the video product retrievalsystem trains a neural network that is able to detect, track, andextract discriminative feature representation information for each itemin a user video. There are several challenges to this approach. First,it may be difficult to collect a large amount of training data of useruploaded videos wearing a certain product and the product's onlineimages from a retailer's website. As an alternative, it may be easier tocollect user-uploaded images of persons wearing a certain product.Second, it can be difficult to extract product features for thedatabase, and train the model, if there are relatively few (2-4) imagesfrom retailers of the product. In some implementations, the videoproduct retrieval system integrates an image-based detection plus afeature-extraction pipeline to enable both image and video productretrieval.

A video may contain multiple image frames showing the same product. Thevideo product retrieval system takes advantage of this by fusing theproduct's representations from a plurality of frames into a singlehigh-quality representation. The speed of the downstream retrievalprocess is increased due to a more compact representation of theproduct, and the retrieval results are improved because the fusedrepresentation is more comprehensive than a representation derived froma single image. In some embodiments, there will be fewer features in thedatabase to search against, resulting in a faster retrieval process.Individual images within a video stream may vary in terms of quality.For example, in a particular video stream, some detections may have poorquality and are thus not suitable to pass to the extractor for featurefusion. Therefore, in some examples, the video product retrieval systemfilters the available image to select only good detections to be usedfor fusion. In some video frames, multiple items may be present, and anassociation mechanism is used to form tracklets of each item acrossvideo frames before feature fusion. The video product retrieval system:i) detects, tracks and generates a feature fusion that results inimproved video product retrieval results; ii) integrates into imagebased retrieval systems; and iii) is able to integrate furtherimprovements of video-based models such as tracking models. In anembodiment, a tracklet is a descriptor that captures a shape and/orjoint motion with in a video segment by identifying spatio-temporalinterest areas within a sequence of individual video frames. In someexamples, a tracklet describes the (potentially moving) location of anobject or object portion within a sequence of video frames. In someexamples, a tracklet includes a movement vector for the region thatdescribes the direction and speed of the object within the frame.

In various embodiments, the processing of video-based queries can beapproached using a variety of techniques. In one example, video framesare treated as a sequential data or image set. If they are treated assequential data, then a recurrent neural network may be used to modelthe temporal dependencies among video frames. However, during theinference time, the output may not be permutation invariant with respectto input frames. If the video frames are treated as an image set, theprediction can be deterministic. Since database products are in the formof an image set, a single image set based model can be applied on boththe query and search domain. Metric learning may be used to learn themapping from one domain to a different domain. Tasks such as faceverification, person re-identification, and product retrieval may usemetric learning, while classification generally does not. Tracklets ofeach face/person/product may be used. When tracklets are used, caseswhere inputs are polluted by false positives are excluded. Either atracking model or an association mechanism may be used to formtracklets.

TABLE 9 Summary of Techniques Training Set vs Tracklets Metric Domaindata Task Sequential available learning mapping abundant Image set set —no — abundant classification Video face sequential yes yes for —abundant recognition verification set yes yes for — abundantverification Person re- sequential yes yes video to abundantidentification video set yes yes video to abundant video Video productsequential yes yes video to scarce retrieval image set set no yes videoto scarce image set

Although different techniques may differ in the above dimensions, ingeneral, many techniques combine multiple features to produce a singleand more comprehensive feature. The fusion can be in the form ofstraightforward temporal or set pooling. Among the pooling options,average pooling may be superior to maximum or minimum pooling along atemporal dimension, in many instances. Advanced methods of fusion relyon a temporal attention mechanism. A soft attention mechanism gives afusion weight to each feature of each video frame, the fusion weight maybe in the form of a quality score that signifies the image quality of acurrent frame. Some implementations use a hard attention mechanism topick out the subset of good frames for fusion, which is modeled as aMarkov Decision Process (“MDP”) that uses reinforcement learning. INsome examples, fusion is performed at the feature level, but fusion canalso happen at the score or metric level. Some examples learn asimilarity network using a tree-like structure to measure the distancebetween a set of queries features and a database feature. However, thesemetric level fusion methods may have lower performance and may be morecomputationally intensive when compared to feature level fusion.

The video product retrieval system takes into account of the followingprospects: i) the retrieval result is permutation invariant with respectto input video frames, so video frames are treated as an image set; ii)quality awareness feature fusion is performed using a quality awarenessmodule; iii) tracklets are generated using an association algorithm toform tracklets.

FIG. 35 shows an illustrative example of a video product retrievalsystem that identifies one or more products from a video or image, in anembodiment. In an embodiment, the video product retrieval system isimplemented as a computer system containing memory and one or moreprocessors. The memory stores executable instructions that, whenexecuted by the one or more processors, cause the computer system toperform operations that implement the system. In various embodiments,the executable instructions may be described by grouping particularportions of the executable instructions into functional components,modules, or interfaces. Such groupings may be made for a variety ofpurposes including improving the readability, understanding, andmaintainability of the executable instructions. In some examples,executable instructions may be grouped and arranged in ways that improvethe performance of the computer system as a whole. In the presentdocument, performance of a particular operation may be described asbeing performed by a particular module or component. Those of ordinaryskill in the art are aware of this practice and understand that, ingeneral, the operation is performed by the one or more processors of thesystem as a result of executing instructions that are associated withthe particular module or component. In an embodiment, the executableinstructions associated with the system include detection, extraction,association and fusion modules.

In an embodiment, the detection modules and extraction modules areimage-based models. The extraction model serves as a feature extractor,and may also serve as an input item image quality checker. Theextraction module is able to determine, for the patch inside eachbounding box predicted by the detector, how good the featurerepresentation of that patch is for the retrieval task. If the boundingbox is not regressed well, then the quality is determined to be low. Insome examples, the level of regression may be determined by a thresholdvalue set by an administrator and stored in a memory of the system. Ifthe bounding box is accurate, but the patch content is not suitable forretrieval (for example due to an occlusion or motion blur) then thequality will also be low. A quality score threshold is used to removeobvious bad detections before they are fed into the association moduleto form tracklets. However, in some examples, quality thresholding maynot be able to filter out false positives from detections, as some ofthe detected false positive items can have high patch quality.Therefore, in such situations, false positives are removed in theassociation module. In addition, the selected patches of each trackletand the corresponding quality scores are passed to the fusion module toget a fused feature for the item that corresponds to the tracklet.Quality scores may be used as weights to fuse the tracklet features. Thefused features are then used to query the database for retrieval. Sinceproduct images are usually high-quality images captured in controlledenvironments with clean backgrounds. The fusion module in the productdomain can be an average fusion technique.

FIG. 36 shows an illustrative example of quality head branch training,in an embodiment. In an embodiment, the video product retrieval systemgenerates quality scores by training a quality prediction head branchusing the mid-level convolutional features from the extraction model.For training data, the video product retrieval system adopts dataaugmentation approaches and labels each augmented box with a qualityscore based on certain empirical metrics, such as intersection overunion ratio with respect to the ground truth, and variance of Laplacianas an estimate of its blurriness. The video product retrieval system maytrain the quality head as a regressor from convolutional features toquality scores, or may train the extractor end to end. In this case, thequality scores may be used to fuse the final features, and metriclearned from the fused features may be used to learn the quality scoresimplicitly.

In an embodiment, an association module determines which items belong tothe same product amongst the set of detected items across video frames.Let I_(c) ^(t)=(I_(c,o) ^(t), I_(c,1) ^(t), . . . , I_(c,n) ^(t)) be theset of n detected items of class c in frame t. Also, let ƒ_(c)^(t)=(ƒ_(c,0) ^(t), ƒ_(c,1) ^(t), . . . ƒ_(c,n) ^(t)) be thecorresponding feature representations of these n detected items. Alength l_(k) tracklet T_(k)=(I_(c,0) ^(t,) _(k) ⁰, I_(c,i) ^(t) _(k) ¹,I_(c,i) ^(t) _(h) ¹, . . . , I_(c,i) ^(t) _(k) ^(l)) is a collection ofdetected items across different video frames that are recognized as thesame product. Each tracklet has a running averaged feature to representthe corresponding tracked product. The video product retrieval systemuses a method based on the distances between ƒ_(c) ^(t) and tracklets'features under certain distance metrics (e.g. euclidean distance) toassociate the clothing items at time t to available tracklets at timet−τ. An example is shown in the method below. Using this method, thevideo product retrieval system is able to track the items across aplurality of video frames.

for each Video frame do   increase tracklet T_(k)'s idle length by 1; end  for each f_(c, i) ^(t) E f_(c) ^(t) do   compare f_(c, i) ^(t)with the fused features of all   tracklets and get the L2 distance tothe   closest tracklet T_(k);   if the distance ≤ threshd then    attachitem I_(c, i) ^(t) tracklet T_(k);    update fused features    f_(Tk=)(f_(Tk) + f_(c, i) ^(t)) / 2    increase tracklet T_(k)'slength by 1;    set tracklet T_(k)'s idle length to 0;    if tracklet'slength > thresh_(active) then     tracklet T_(k) is activated;     sendf_(Tk) for product retrieval;    end   end   else    create a newtracklet T_(z);    Attach item I_(c, i) ^(t) the new tracklet T_(z);   set tracklet T_(z)'s fused feature f_(Tz) ≡ f_(c, i) ^(t);    settracklet T_(z)'s length to 1;    set tracklet T_(z)'s idle length to 0;  end  end  for each tracklet T_(k) do   if its idle length >thresh_(idle) then    delete tracklet Tk;   end  end end

Once a fashion item has been tracked across video frames, the videoproduct retrieval system fuses the features for that particular item.This can be achieved in various ways, one of which is to calculate theweighted average using quality scores. Let ƒ_(i,c)=(ƒ_(i,c) ⁰, ƒ_(i,c)¹, . . . , ƒ_(i,c) ^(p)) and q_(i,c)=(q_(i,c) ⁰, q_(i,c) ¹, . . . ,q_(i,c) ^(p)) be the set of features and quality scores for clothingitem

of class c. The fused feature for that item may be calculated as:

$f_{i,c}^{p} = {\frac{1}{p}{\sum\limits_{m = 1}^{p}{q_{i,c}^{m}*f_{i,c}^{m}}}}$

Note that the fusion here is different from the running average in theassociation process, although in principle both processes could use thesame fusion module. In some implementations, combining the Fusion modulewith the association module may ease the association difficulties byplacing additional weight on recent features.

Video based processing modules can be integrated into the video productretrieval system framework. Tracking can be integrated within adetection module to ease the burdens on or even replace the associationmodule. An attention mask can also be generated with quality scores, andthe attention mask may, in some embodiments, be used to aid retrieval.If video frames are treated as sequential data, a recurrent unit in thefusion module can accept features within a tracklet sequentially,thereby adjusting the fusion module's knowledge about the trackedproduct and producing the fused feature, with quality awareness embeddedin its intermediate hidden states.

The video product retrieval system can be used to improve existingimage-based product retrieval systems for end-to-end video productretrieval. In various examples, the video product retrieval systemachieves this by removing image based detection's false positivesthrough quality score filtering and association. In addition, in someexamples, a quality aware feature fusion provides a comprehensiverepresentation for product retrieval, and improves the scalability ofthe system.

In an embodiment, a computer system analyzes the characteristics andattributes of clothing using image and video information. In variousexamples, the system is able to achieve human-level understanding. Thesystem uses object detection to achieve localization and categorizationof an object such as a dress. In some examples, the system performs ananalysis that goes beyond mere categorization to generate an enhancedprofile of each piece of clothing. For example, for a dress, the systemmay determine the color, pattern, material, type of sleeve, type ofcollar, and other attributes. The system goes beyond identifyingattributes of a particular image by associating particular attributeswith particular subjects within an image. By doing so, the system isable to localize the object present in the image and represent theinformation specific to a product contained in the image. In variousimplementations, the system provides an end-to-end detector andattribute network that localizes and categorizes the products present aswell as finds specific mid-level attributes corresponding to eachproduct.

FIG. 37 shows an illustrative example of a product web page thatincludes product attributes, in an embodiment. In many examples,designers and retailers add attributes describing the items being sold.For example, in FIG. 37 , the retailer includes a description ofattributes that may be helpful to a potential buyer (e.g. upper hemline,material, color, etc.). This process may be performed manually. Invarious embodiments described herein, a computer vision system automatesthe task of determining product attributes using images or video framesof the products in question.

In some embodiments, the computer vision system uses deep-learning-basedsystems, which provide improved performance of computer vision tasks. Invarious examples, a subset of deep network architectures perform quitewell on object detection tasks. Such architectures are built to identifyobject instances within images and/or video frames.

FIG. 38 shows an illustrative example of output from a detection andattribute network, in an embodiment. The described computer visionsystem uses the systems for identifying clothing objects within imagesand video content, and to use these detections to provide our users witha list of clothing/apparel attributes. This can be accomplished bybuilding a deep learning architecture composed of two primarymodules: 1) a module for detecting fashion items, and 2) a module forgenerating a list of product attributes for each detected item. Anexample of the output of this system is shown in FIG. 38 , in which allclothing items are detected and their respective attributes listed.

Retailers are able to use the computer vision system for variousapplications including but not limited to:

-   -   Visual SEO—Automatically enriching the information for each        clothing item in an online retailer's inventory.    -   Better Categorization/Taxonomies—Parsing the entire inventory of        an online retailer and categorizing along the lines of color,        pattern, material as well as things like dress category.    -   Attribute Based Search—Searching an online retailer's inventory        using mid-level attributes that are automatically populated.    -   Fashion Trend Analysis Utilizing attributes to analyze fashion        trends on client's sites and social media platforms. These        insights can then be used to improve sales through better        consumer understanding.

The present document describes a computer vision system. The computervision system provides an end-to-end system capable of localizing,detecting and extracting products from image and video content andproducing attributes affiliated with those items. In some embodiments,the computer vision system is integrated with an attribute-extractionsystem with visual search to improve the relevancy of search resultswith product queries extracted from image or video content.

Deep-learning based object detection methods may be divided into severalcategories. One approach is a two-step method where the input image isfirst passed through an object proposal network and then passed througha classification head. Another approach is a one-step method wherebounding boxes are directly located and predicted in one step.

In various embodiments, the computer vision system described herein usesa two-stage method. The computer vision system may use a region proposalnetwork to locate candidate bounding boxes and use them forclassification. In some implementations, clothing items have uniqueattributes related to them. For example, T-shirts may have uniqueattributes like sleeve length, hemline, closure type, and so on, whereasshoes may have attributes such as heel type, heel length, toe type andso on. In an embodiment, the attributes network detects high levelclothing categories prior to predicting the attributes. In anotherembodiment, the computer vision system divides a network into twoparts—a high level clothing category detector and an attributesclassifier.

In the present document, a computer vision system is described. In anembodiment, the computer vision system is implemented as a computersystem containing memory and one or more processors. The memory storesexecutable instructions that, when executed by the one or moreprocessors, cause the computer system to perform operations thatimplement the system. In various embodiments, the executableinstructions may be described by grouping particular portions of theexecutable instructions into functional components, modules, orinterfaces. Such groupings may be made for a variety of purposesincluding improving the readability, understanding, and maintainabilityof the executable instructions. In some examples, executableinstructions may be grouped and arranged in ways that improve theperformance of the computer system as a whole. In the present document,performance of a particular operation may be described as beingperformed by a particular module or component. Those of ordinary skillin the art are aware of this practice and understand that, in general,the operation is performed by the system as a result of executinginstructions that are associated with the particular module orcomponent. In an embodiment, the executable instructions associated withthe system include detector and attribute components.

In some implementations, deep learning neural networks may becomputationally expensive, thereby making it impractical to use asliding window approach to localize objects and predict their categoriesin some situations. To address this problem, certain embodiments of thecomputer vision system use a region proposal network to output candidatebounding boxes where the object is likely to be present. Convolutionalneural networks are used to extract discriminative features from thesecandidate bounding boxes. These extracted features are then fed into aclassifier for category classification.

In an embodiment, the detector provides the attributes network withprior information regarding 1) high level clothing categories and 2)locations in the input image. The attributes net further extractsconvolutional features within the final bounding boxes provided by thedetector to predict attributes on top of the high level clothingcategory. For example, if the detector predicts a dress, the attributesnetwork predicts attributes related to dresses (dress type, sleevelength, upper hemline and so forth).

In an embodiment, the functionality described in sections 2.1 and 2.2may be implemented as two separate convolutional neural network (“CNN”)architectures. These CNNs 1) localize and categorize clothing items and2) predict attributes. In some implementations, using separate networksfor training and inference may be cumbersome. Therefore, in someimplementations, the computer vision system combines the detection andattributes networks, producing a single network that can be trained inan end-to-end fashion.

The initial network layers may be task-agnostic and extract low levelfeatures like edges, shapes and so on. The attributes network can sharethe features of the initial layer in the detection network and canutilize these low-level features to compute task-specific high-levelfeatures for attribute detection. By sharing computations, thisend-to-end architecture alleviates most of the computational burdenassociated with implementations that utilize two separate convolutionalnetworks.

FIG. 39 shows an illustrative example of a schematic of a detection andattribute network, in an embodiment. An image is processed through aconvolutional neural network to extract a feature volume. The featurevolume is then passed through a region proposal network that defines oneor more regions of interest. The regions are further passed to aclassification head and bounding-box regression head that predicts thecategory of the clothing item encapsulated by the box, and that predictsthe final bounding-box coordinates. An attribute network is alsoattached on top of the feature maps extracted by the detector. Theattribute network takes as input the same regions as the classificationand regression heads but yields attributes of the clothing items.

  End-to-end training method Input: Image-label pairs{X_(i,)Y_(i,)}Y_(i) = (p_(j) ^(*) _(,)b_(i) ^(*) _(,)a *_(j) ^(k)) forImage i in the batch do  

  Extract conv features  X_(conv) = f (X_(i))  

  Get regions of interest using proposal algorithm  B_(roi) =P_(N)(X_(conv))  

  Object classification & bbox regression on each roi  p_(j,)b_(j) =g(X_(conv), p_(j), b_(j))  

  Predict K attributes on each detected boxes  a_(j) ^(k) = h(X_(conv),B_(roi) ^(j))  

  Compute Losses

  Compute Losses  $L_{({{cls},{reg},{attr}})} = {{\sum\limits_{j = 1}^{N}{L_{entropy}\left( {p_{\,j_{,}}p_{j}^{*}} \right)}} +}$  ${\sum\limits_{j = 1}^{N}{L_{reg}\left( {b_{i},b_{i}^{*}} \right)}} +$  $\sum\limits_{j = 1}^{N}{\sum\limits_{k = 1}^{K}{k.{L_{entropy}\left( {a_{j}^{k},{a*_{j}^{k}}} \right)}}}$

 Back propogate and update weights

Class imbalance is an issue that may arise in machine learning. Classimbalance results from a class distribution that is highly skewed towarda few classes. For example, fraud detection is an example where very fewtransactions are classified as fraudulent. Training a classifier withsuch class imbalance can cause the classifier to be biased towards thedominant class (for example, non-fraudulent samples in the case of frauddetection).

Fashion datasets face a similar issue with attribute classes. Commonlyworn styles dominate over more exotic ones. For example, upper-wearclothing styles like crew necks and classic collars may be much moreabundant in fashion data sets than non-traditional collar types; solidpattern types may be more abundant than polka-dots. Training anattribute detector naively on such datasets may produce a biasedclassifier. To solve this problem, the computer vision system assigns aweight to the attribute loss by

$\lambda_{k} = \frac{1}{Nk}$

where N_(k) is the frequency of the k^(th) attribute in the trainingdata. Thus, less prevalent attributes will be given more weight thanhigher-frequency attributes. This weighting procedure modulates thegradients accordingly, resulting in an unbiased classifier.

The method above illustrates an example of an end-to-end training methodused by the computer vision system where p_(j)* is the ground truthclass probability of the j^(th) ROI, b_(i)* is the bbox regressiontarget coordinates of j^(th) ROI, a*_(j) ^(k) is the k^(th) attribute ofthe j^(th) ROI. λ_(k) is the weight assigned to the loss for the k^(th)attribute.

FIG. 40 illustrates an environment in which various embodiments can beimplemented. FIG. 40 is an illustrative, simplified block diagram of anexample computing device 4000 that may be used to practice at least oneembodiment of the present disclosure. In various embodiments, thecomputing device 4000 may be used to implement any of the systemsillustrated herein and described above. For example, the computingdevice 4000 may be configured for use as a data server, a web server, aportable computing device, a personal computer, or any electroniccomputing device. As shown in FIG. 40 , the computing device 4000 mayinclude one or more processors 4002 that may be configured tocommunicate with, and are operatively coupled to, a number of peripheralsubsystems via a bus subsystem 4004. The processors 4002 may be utilizedfor the traversal of decision trees in a random forest of supervisedmodels in embodiments of the present disclosure (e.g., cause theevaluation of inverse document frequencies of various search terms,etc.). These peripheral subsystems may include a storage subsystem 4006,comprising a memory subsystem 4008 and a file storage subsystem 4010,one or more user interface input devices 4012, one or more userinterface output devices 4014, and a network interface subsystem 4016.Such storage subsystem 4006 may be used for temporary or long-termstorage of information such as details associated with transactionsdescribed in the present disclosure, databases of historical recordsdescribed in the present disclosure, and storage of decision rules ofthe supervised models in the present disclosure).

The bus subsystem 4004 may provide a mechanism for enabling the variouscomponents and subsystems of a computing device 4000 to communicate witheach other as intended. Although the bus subsystem 4004 is shownschematically as a single bus, alternative embodiments of the bussubsystem utilize multiple busses. The network interface subsystem 4016may provide an interface to other computing devices and networks. Thenetwork interface subsystem 4016 may serve as an interface for receivingdata from, and transmitting data to, other systems from the computingdevice 4000. For example, the network interface subsystem 4016 mayenable a data technician to connect the device to a wireless networksuch that the data technician may be able to transmit and receive datawhile in a remote location, such as a user data center. The bussubsystem 4004 may be utilized for communicating data, such as details,search terms, and so on, to the supervised model of the presentdisclosure, and may be utilized for communicating the output of thesupervised model to the one or more processors 4002 and to merchantsand/or creditors via the network interface subsystem 4016.

The user interface input devices 4012 may include one or more user inputdevices, such as a keyboard, pointing devices such as an integratedmouse, trackball, touchpad, or graphics tablet, a scanner, a barcodescanner, a touch screen incorporated into the display, audio inputdevices such as voice recognition systems, microphones, and other typesof input devices. In general, use of the term “input device” is intendedto include all possible types of devices and mechanisms for inputtinginformation to the computing device 4000. The one or more user interfaceoutput devices 4014 may include a display subsystem, a printer, ornon-visual displays such as audio output devices, etc. The displaysubsystem may be a cathode ray tube (CRT), a flat-panel device such as aliquid crystal display (LCD), light emitting diode (LED) display, or aprojection or other display device. In general, use of the term “outputdevice” is intended to include all possible types of devices andmechanisms for outputting information from the computing device 4000.The one or more output devices 4014 may be used, for example, to presentuser interfaces to facilitate user interaction with applicationsperforming processes described herein and variations therein, where suchinteraction may be appropriate.

The storage subsystem 4006 may provide a computer-readable storagemedium for storing the basic programming and data constructs that mayprovide the functionality of at least one embodiment of the presentdisclosure. The applications (programs, code modules, instructions)that, as a result of being executed by one or more processors, mayprovide the functionality of one or more embodiments of the presentdisclosure, may be stored in the storage subsystem 4006. Theseapplication modules or instructions may be executed by the one or moreprocessors 4002. The storage subsystem 4006 may additionally provide arepository for storing data used in accordance with the presentdisclosure. The storage subsystem 4006 may comprise a memory subsystem4008 and a file/disk storage subsystem 4010.

The memory subsystem 4008 may include a number of memories, including amain random access memory (RAM) 4018 for storage of instructions anddata during program execution and a read-only memory (ROM) 4020 in whichfixed instructions may be stored. The file storage subsystem 4010 mayprovide a non-transitory persistent (non-volatile) storage for programand data files, and may include a hard disk drive, a floppy disk drivealong with associated removable media, a Compact Disk Read-Only Memory(CD-ROM) drive, an optical drive, removable media cartridges, and otherlike storage media.

The computing device 4000 may include at least one local clock 4024. Thelocal clock 4024 may be a counter that represents the number of ticksthat have transpired from a particular starting date and may be locatedintegrally within the computing device 4000. The local clock 4024 may beused to synchronize data transfers in the processors for the computingdevice 4000 and all of the subsystems included therein at specific clockpulses and may be used to coordinate synchronous operations between thecomputing device 4000 and other systems in a data center. In oneembodiment, the local clock 4024 is an atomic clock. In anotherembodiment, the local clock is a programmable interval timer.

The computing device 4000 may be of various types, including a portablecomputer device, a tablet computer, a workstation, or any other devicedescribed below. Additionally, the computing device 4000 may includeanother device that may be connected to the computing device 4000through one or more ports (e.g., USB, a headphone jack, Lightningconnector, etc.). The device that may be connected to the computingdevice 4000 may include a plurality of ports configured to acceptfiber-optic connectors. Accordingly, this device may be configured toconvert optical signals to electrical signals that may be transmittedthrough the port connecting the device to the computing device 4000 forprocessing. Due to the ever-changing nature of computers and networks,the description of the computing device 4000 depicted in FIG. 40 isintended only as a specific example for purposes of illustrating thepreferred embodiment of the device. Many other configurations havingmore or fewer components from the system depicted in FIG. 40 arepossible.

FIG. 41 illustrates aspects of an example environment 4100 forimplementing aspects in accordance with various embodiments. Aclient/server environment is shown for the purposes of explanation, butother environments may be used in other implementations. The environmentincludes a client computer system 4102. The client computer system canbe a desktop computer, laptop computer, computing appliance, or mobiledevice that is able to send or receive information over a computernetwork 4104. Other examples of client computer systems include cellphones, tablet computers, wearable devices, personal digital assistants(“PDA's”), embedded control systems, and smart appliances. The computernetwork 4104 can be a wired or wireless network. Wired networks caninclude wired networks such as Ethernet (10 baseT, 100 baseT, orGigabit), AppleTalk, Token Ring, Fiber Channel, USB, RS-232, orPowerline networks, or wireless networks such as 802.11 Wi-Fi,Bluetooth, or infrared-communication-based networks. A variety ofcommunication protocols may be used over the computer network 4104. Thecommunication protocols may include TCP/IP, IPX, or DLC. A variety ofintermediate protocols may operate on top of these protocols such asHTTP, HTTP secure (“HTTPS”), simple network management protocol(“SNMP”), and simple mail transfer protocol (“SMTP”). The computernetwork 4104 may include a combination of subnetworks including theInternet, internal home networks, or business intranets.

The environment includes a server computer system 4106. The servercomputer system 4106 receives requests from various computer systemsconnected to the computer network 4104 including the client computersystem 4102. The server computer system 4106 can be a server computersystem, a number of server computer systems arranged in a servercluster, or virtual computer system capable of receiving requests andsending responses over the computer network 4104. In some environments,a personal computer system, handheld device, or cell phone can performthe functions of the server computer system 4106. If more than oneaddressable device is used to process requests, a load balancer or othercoordinating entity such as a firewall may be placed between the clientcomputer system 4102 and a server computer system 4106. The loadbalancer may receive requests on behalf of a collection of serverdevices, and route requests across the collection of server devices.

The server computer system 4106 may implement a plurality of services byexporting more than one service interface. For example, a number ofservices may be implemented on the server computer system 4106 as acorresponding number of processes. Each process may be bound todifferent network address and/or network port. A particular networkclient can access a particular service by submitting a request to thecorresponding network address and port.

The server computer system 4106 is connected to a data store 4108. Theterm data store may refer to a device capable of storing and retrievingcomputer readable information such as disk drives, semiconductor RAM,ROM, flash memory, optical disk, CD-ROM, EEPROM. In someimplementations, write-once/read-many memory such as EEPROM memory maybe used to generate a data store. In some implementations, a databasemay be used to store information. In some examples, a database may becreated through the use of a commercial application such as SQL Server,Oracle, Access, or other relational database engine. Tables and keys aredefined that allow for rapid and efficient access to information usingparticular key values. Tables may be linked for quick and efficientaccess to data. Relational database engines allow operations to beperformed on stored data using a standard query language (“SQL”). SQLcommands or scripts may be submitted that create, alter, delete, orsynthesize information stored within the database. Those skilled in theart will appreciate that, in some systems, some database functions maybe integrated into an application. Hash tables, ordered lists, stacksand queues may be implemented and arranged to perform similarfunctionality in many applications. The term “data store” refers to anydevice or combination of devices capable of storing, accessing andretrieving data, which may include any combination and number of dataservers, databases, data storage devices and data storage media, in anystandard, distributed, virtual or clustered environment. As used herein,the term “database” refers to both commercial database engines andcustom implementations of database functionality using ordered andindexed data structures, hash tables, arrays, linked lists, key-valuepair structures, and the like.

A server computer system 4106 may provide access and authenticationcontrols that limit access to the information maintained in the datastore 4108. An authentication system controls access to the servercomputer system by verifying the identity of the person or entitysubmitting a request to the server computer system 4106. Authenticationis achieved by validating authentication information such as a usernameand password, a digital signature, or a biometric value. In someimplementations, authentication occurs through the submission of ausername and password known only by an authorized user. In anotherimplementation, authentication occurs as a result of the submission of adigital signature using a cryptographic key known to be under thecontrol of the client computer system 4102. The cryptographic key may bea private cryptographic key associated with a digital certificate.Requests submitted to the server computer system 4106 may be subject toauthorization controls. Authorization controls may be based at least inpart on the identity of the requester or the requesting device. In someimplementations, authorization controls may subject service requests toa time-based or data-rate throttling limitation.

Content stored on the data store 4108 and served by the server computersystem 4106 may include documents, text, graphics, music or audio, videocontent, executable content, executable scripts, or binary data for usewith a computer application. For example, content served by Web servermay be in HyperText Markup Language (“HTML”), Extensible Markup Language(“XML”), JavaScript, Cascading Style Sheets (“CSS”), JavaScript ObjectNotation (JSON), and/or another appropriate format. Content may beserved from the server computer system 4106 to the client computersystem 4102 in plaintext or encrypted form.

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. However, it will beevident that various modifications and changes may be made thereuntowithout departing from the scope of the invention as set forth in theclaims. Likewise, other variations are within the scope of the presentdisclosure. Thus, while the disclosed techniques are susceptible tovarious modifications and alternative constructions, certain illustratedembodiments thereof are shown in the drawings and have been describedabove in detail. It should be understood, however, that there is nointention to limit the invention to the specific form or forms disclosedbut, on the contrary, the intention is to cover all modifications,alternative constructions and equivalents falling within the scope ofthe invention, as defined in the appended claims.

The use of the terms “a” and “an” and “the” and similar referents in thecontext of describing the disclosed embodiments (especially in thecontext of the following claims) is to be construed to cover both thesingular and the plural, unless otherwise indicated or clearlycontradicted by context. The terms “comprising”, “having”, “including”,and “containing” are to be construed as open-ended terms (i.e., meaning“including, but not limited to”) unless otherwise noted. The term“connected”, when unmodified and referring to physical connections, isto be construed as partly or wholly contained within, attached to orjoined together, even if there is something intervening. Recitation ofranges of values in the present disclosure are merely intended to serveas a shorthand method of referring individually to each separate valuefalling within the range unless otherwise indicated, and each separatevalue is incorporated into the specification as if it were individuallyrecited. The use of the term “set” (e.g., “a set of items”) or “subset”,unless otherwise noted or contradicted by context, is to be construed asa nonempty collection comprising one or more members. Further, unlessotherwise noted or contradicted by context, the term “subset” of acorresponding set does not necessarily denote a proper subset of thecorresponding set, but the subset and the corresponding set may beequal.

Conjunctive language, such as phrases of the form “at least one of A, B,and C”, or “at least one of A, B and C”, unless specifically statedotherwise or otherwise clearly contradicted by context, is otherwiseunderstood with the context as used in general to present that an item,term, etc., may be either A or B or C, or any nonempty subset of the setof A and B and C. For instance, in the illustrative example of a sethaving three members, the conjunctive phrases “at least one of A, B, andC” and “at least one of A, B and C” refer to any of the following sets:{A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}. Thus, such conjunctivelanguage is not generally intended to imply that certain embodimentsrequire at least one of A, at least one of B and at least one of C eachto be present.

Operations of processes described can be performed in any suitable orderunless otherwise indicated or otherwise clearly contradicted by context.Processes described (or variations and/or combinations thereof) may beperformed under the control of one or more computer systems configuredwith executable instructions and may be implemented as code (e.g.,executable instructions, one or more computer programs or one or moreapplications) executing collectively on one or more processors, byhardware or combinations thereof. The code may be stored on acomputer-readable storage medium, for example, in the form of a computerprogram comprising a plurality of instructions executable by one or moreprocessors. The computer-readable storage medium may be non-transitory.

The use of any and all examples, or exemplary language (e.g., “such as”)provided, is intended merely to better illuminate embodiments of theinvention and does not pose a limitation on the scope of the inventionunless otherwise claimed. No language in the specification should beconstrued as indicating any non-claimed element as essential to thepractice of the invention.

Embodiments of this disclosure are described, including the best modeknown to the inventors for carrying out the invention. Variations ofthose embodiments may become apparent to those of ordinary skill in theart upon reading the foregoing description. The inventors expect skilledartisans to employ such variations as appropriate and the inventorsintend for embodiments of the present disclosure to be practicedotherwise than as specifically described. Accordingly, the scope of thepresent disclosure includes all modifications and equivalents of thesubject matter recited in the claims appended hereto as permitted byapplicable law. Moreover, any combination of the above-describedelements in all possible variations thereof is encompassed by the scopeof the present disclosure unless otherwise indicated or otherwiseclearly contradicted by context.

All references, including publications, patent applications, andpatents, cited are hereby incorporated by reference to the same extentas if each reference were individually and specifically indicated to beincorporated by reference and were set forth in its entirety.

It should be noted that the above-mentioned embodiments illustraterather than limit the invention, and that those skilled in the art willbe capable of designing many alternative embodiments without departingfrom the scope of the invention, as defined by the appended claims. Inthe claims, any reference signs placed in parentheses shall not beconstrued as limiting the claims. The words “comprising”, “comprises”,and the like do not exclude the presence of elements or steps other thanthose listed in any claim or the specification as a whole. In thepresent specification, “comprises” means “includes or consists of” and“comprising” means “including or consisting of.” The singular referenceof an element does not exclude the plural reference of such elements andvice-versa. The invention may be implemented by means of hardwarecomprising several distinct elements and by means of a suitablyprogrammed computer. In a device claim enumerating several means,several of these means may be embodied by one and the same item ofhardware. The mere fact that certain measures are recited in mutuallydifferent dependent claims does not indicate that a combination of thesemeasures cannot be used to advantage.

Data encryption may be accomplished using various forms of symmetricand/or asymmetric cryptographic primitives. Symmetric key algorithms mayinclude various schemes for performing cryptographic operations on dataincluding block ciphers, stream ciphers and digital signature schemes.Example symmetric key algorithms include the advanced encryptionstandard (AES), the data encryption standard (DES), triple DES (3DES),Serpent, Twofish, blowfish, CAST5, RC4 and the international dataencryption algorithm (IDEA). Symmetric key algorithms may also includethose used to generate output of one way functions and includealgorithms that utilize hash-based message authentication codes (HMACs),message authentication codes (MACs) in general, PBKDF2 and Bcrypt.Asymmetric key algorithms may also include various schemes forperforming cryptographic operations on data. Example algorithms includethose that utilize the Diffie-Hellman key exchange protocol, the digitalsignature standard (DSS), the digital signature algorithm, the ElGamalalgorithm, various elliptic curve algorithms, password-authenticated keyagreement techniques, the pallier cryptosystem, the RSA encryptionalgorithm (PKCS #1), the Cramer-Shoup cryptosystem, the YAKauthenticated key agreement protocol, the NTRUEncrypt cryptosystem, theMcEliece cryptosystem, and others. Elliptic curve algorithms include theelliptic curve Diffie-Hellman (ECDH) key agreement scheme, the EllipticCurve Integrated Encryption Scheme (ECIES), the Elliptic Curve DigitalSignature Algorithm (ECDSA), the ECMQV key agreement scheme and the ECQVimplicit certificate scheme. Other algorithms and combinations ofalgorithms are also considered as being within the scope of the presentdisclosure and the above is not intended to be an exhaustive list.

Note also that the examples used herein may be performed in compliancewith one or more of: Request for Comments (RFC) 4250, RFC 4251, RFC4252, RFC 4253, RFC 4254, RFC 4255, RFC 4256, RFC 4335, RFC 4344, RFC4345, RFC 4419, RFC 4432, RFC 4462, RFC 4716, RFC 4819, RFC 5647, RFC5656, RFC 6187, RFC 6239, RFC 6594, and RFC 6668, which are incorporatedby reference.

Generally, embodiments of the present disclosure may use variousprotocols, such as a SSL or TLS protocol and extensions thereto, such asdefined in Request for Comments (RFC) 2246, RFC 2595, RFC 2712, RFC2817, RFC 2818, RFC 3207, RFC 3268, RFC 3546, RFC 3749, RFC 3943, RFC4132, RFC 4162, RFC 4217, RFC 4279, RFC 4347, RFC 4366, RFC 4492, RFC4680, RFC 4681, RFC 4785, RFC 5054, RFC 5077, RFC 5081, RFC 5238, RFC5246, RFC 5288, RFC 5289, RFC 5746, RFC 5764, RFC 5878, RFC 5932, RFC6083, RFC 6066, RFC 6091, RFC 6176, RFC 6209, RFC 6347, RFC 6367, RFC6460, RFC 6655, RFC 7027, and RFC 7366 which are incorporated herein byreference, to establish encrypted communications sessions. Otherprotocols implemented below the application layer of the Open SystemsInterconnect (OSI) model may also be used and/or adapted to utilizetechniques described herein. It should be noted that the techniquesdescribed herein are adaptable to other protocols such as the Real TimeMessaging Protocol (RTMP), the Point-to-Point Tunneling Protocol (PPTP),the Layer 2 Tunneling Protocol, various virtual private network (VPN)protocols, Internet Protocol Security (e.g., as defined in RFC 1825through 1829, RFC 2401, RFC 2412, RFC 4301, RFC 4309, and RFC 4303) andother protocols, such as protocols for secure communication that includea handshake.

In the preceding and following description, various techniques aredescribed. For purposes of explanation, specific configurations anddetails are set forth in order to provide a thorough understanding ofpossible ways of implementing the techniques. However, it will also beapparent that the techniques described below may be practiced indifferent configurations without the specific details. Furthermore,well-known features may be omitted or simplified to avoid obscuring thetechniques being described.

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. It will, however, beevident that various modifications and changes may be made thereuntowithout departing from the broader spirit and scope of the invention asset forth in the claims.

Embodiments of the disclosure can be described in view of the followingclauses:

-   -   1. A computer-implemented method, comprising:        -   acquiring an image;        -   determining a set of regions of interest in the image, the            set of regions containing a set of objects;        -   determining a set of potential categories for each object in            the set of objects based on a hierarchical tree of object            categories;        -   identifying, from the set of potential categories for each            object in the set of objects, a category for each object in            the set of objects;        -   determining that at least one object in the set of objects            matches a item identified by a user;        -   identifying a set of items that match the set of objects;            and        -   identifying the set of items to the user.    -   2. The computer-implemented method of clause 1 wherein        determining that an item matches an object is accomplished by at        least:        -   determining a set of attributes for the object; and        -   determining that attributes of the item match the set of            attributes.    -   3. The computer-implemented method of clause 2 wherein:        -   the set of attributes is determined using a convolutional            neural network; and        -   less prevalent attributes in the set of attributes are given            more weight than higher-frequency attributes.    -   4. The computer-implemented method of any of clauses 1-3,        wherein the item is an article of clothing, a piece of jewelry,        a bag, or a set of eyeglasses.    -   5. The computer-implemented method of any of clauses 1-4,        wherein:        -   the item is identified by the user by the user providing an            image of the item; and        -   the computer-implemented method further comprises            identifying the item from the image.    -   6. The computer-implemented method of any of clauses 1-5,        wherein the set of items is identified by identifying items that        have attributes that match the attributes of the set of objects.    -   7. A computer system, comprising:        -   a processor; and        -   memory storing instructions that, when executed by the            processor, cause the computer system to:        -   present a set of images on a display, each image in the set            of images showing a representation of a set of objects;        -   acquire information indicating a selection of a particular            image of the set of images;        -   determining a set of potential categories for each object in            the set of objects in the particular image based on a            hierarchical tree of object categories;        -   identifying, from the set of potential categories for each            object in the set of objects in the particular image, a            category;        -   identify, based at least in part on the category of each            object in the set of objects, a set of attributes for each            object in the set of objects in the particular image;        -   identify, based on the set of attributes, one or more items            that match at least one object in the set of objects in the            particular image; and        -   present the one or more items on the display.    -   8. The computer system of clause 7, wherein the instructions        further cause the computer system to:        -   acquire, from a user, an indication that identifies a            particular item; and        -   determine the set of images by identifying look images that            include a representation of a an article that matches the            particular item.    -   9. The computer system of clause 7 or 8, wherein:        -   the computer system is a cell phone that includes a camera;            and the set of images includes an image acquired by the            computer system using the camera.    -   10. The computer system of any of clauses 7-9, wherein:        -   the instructions further cause the computer system to            acquire a look record for each image of the set of images;            and        -   each look record describes an associated set of objects for            the look record and a set of attributes for each article in            the associated set of objects.    -   11. The computer system of any of clauses 7-10, wherein the        instructions further cause the computer system to:        -   present an image of the set of images on a display; and        -   in response to a user swiping the display, presenting a            different image of the set of images on the display.    -   12. The computer system of any of clauses 7-11, wherein the set        of attributes includes a color, a texture, and a pattern.    -   13. The computer system of any of clauses 7-12, wherein the        instructions further cause the computer system to:        -   acquire a video segment that includes image frames;        -   identify an article across a plurality of the image frames            using a tracklet; and        -   identify attributes of the article using the tracklet.    -   14. The computer system of any of clauses 7-13, wherein the        instructions further cause the computer system to identify an        item that matches an article by at least:        -   determining an item category for the article; and        -   searching items in the item category for items with            attributes matching attributes of the article.    -   15. The computer system of any of clauses 7-14, wherein the set        of images is determined by at least:        -   acquiring information that identifies a particular person;            and adding, to the set of images, images of the particular            person.    -   16. A non-transitory computer-readable storage medium storing        instructions that, as a result of being executed by a processor        of a computing system cause the computing system to:        -   receive a request that identifies an image;        -   identify an object represented in the image;        -   determining a set of potential categories for the object in            the image based on a hierarchical tree of object categories;        -   identifying, from the set of potential categories for the            object in the image, a category;        -   identify, based at least in part on the category of the            object, a set of characteristics for the object in the            image; and        -   identify one or more similar objects from a database of            objects based at least in part on the set of            characteristics.    -   17. The non-transitory computer-readable storage medium of        clause 16, wherein the instructions include a script that is        downloaded into a memory of a browser running on a client        computer system.    -   18. The non-transitory computer-readable storage medium of        clauses 16 or 17, wherein the object is identified by at least:        -   identifying an region of the image containing an article;        -   determining a category of the article;        -   determining that the category of the article matches the            category of the object; and        -   determining that a threshold number of attributes of the            articles match attributes of the object.    -   19. The non-transitory computer-readable storage medium of any        of clauses 16-18, further comprising instructions that, as a        result of being executed by the processor of the computing        system cause the computing system to present the one or more        similar objects to a user via a display on a web browser.    -   20. The non-transitory computer-readable storage medium of        clause 19, further comprising instructions that, as a result of        being executed by the processor of the computing system cause        the computing system to provide a selectable link that enables        the user to purchase at least one of the one or more similar        objects.

Other variations are within the spirit of the present disclosure. Thus,while the disclosed techniques are susceptible to various modificationsand alternative constructions, certain illustrated embodiments thereofare shown in the drawings and have been described above in detail. Itshould be understood, however, that there is no intention to limit theinvention to the specific form or forms disclosed, but on the contrary,the intention is to cover all modifications, alternative constructions,and equivalents falling within the spirit and scope of the invention, asdefined in the appended claims.

The use of the terms “a” and “an” and “the” and similar referents in thecontext of describing the disclosed embodiments (especially in thecontext of the following claims) are to be construed to cover both thesingular and the plural, unless otherwise indicated herein or clearlycontradicted by context. The terms “comprising,” “having,” “including,”and “containing” are to be construed as open-ended terms (i.e., meaning“including, but not limited to,”) unless otherwise noted. The term“connected,” when unmodified and referring to physical connections, isto be construed as partly or wholly contained within, attached to, orjoined together, even if there is something intervening. Recitation ofranges of values herein are merely intended to serve as a shorthandmethod of referring individually to each separate value falling withinthe range, unless otherwise indicated herein and each separate value isincorporated into the specification as if it were individually recitedherein. The use of the term “set” (e.g., “a set of items”) or “subset”unless otherwise noted or contradicted by context, is to be construed asa nonempty collection comprising one or more members. Further, unlessotherwise noted or contradicted by context, the term “subset” of acorresponding set does not necessarily denote a proper subset of thecorresponding set, but the subset and the corresponding set may beequal.

Conjunctive language, such as phrases of the form “at least one of A, B,and C,” or “at least one of A, B and C,” unless specifically statedotherwise or otherwise clearly contradicted by context, is otherwiseunderstood with the context as used in general to present that an item,term, etc., may be either A or B or C, or any nonempty subset of the setof A and B and C. For instance, in the illustrative example of a sethaving three members, the conjunctive phrases “at least one of A, B, andC” and “at least one of A, B and C” refer to any of the following sets:{A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}. Thus, such conjunctivelanguage is not generally intended to imply that certain embodimentsrequire at least one of A, at least one of B and at least one of C eachto be present.

Operations of processes described herein can be performed in anysuitable order unless otherwise indicated herein or otherwise clearlycontradicted by context. Processes described herein (or variationsand/or combinations thereof) may be performed under the control of oneor more computer systems configured with executable instructions and maybe implemented as code (e.g., executable instructions, one or morecomputer programs or one or more applications) executing collectively onone or more processors, by hardware or combinations thereof. The codemay be stored on a computer-readable storage medium, for example, in theform of a computer program comprising a plurality of instructionsexecutable by one or more processors. The computer-readable storagemedium may be non-transitory. In some embodiments, the code is stored ona set of one or more non-transitory computer-readable storage mediahaving stored thereon executable instructions that, when executed (i.e.,as a result of being executed) by one or more processors of a computersystem, cause the computer system to perform operations describedherein. The set of non-transitory computer-readable storage media maycomprise multiple non-transitory computer-readable storage media and oneor more of individual non-transitory storage media of the multiplenon-transitory computer-readable storage media may lack all of the codewhile the multiple non-transitory computer-readable storage mediacollectively store all of the code. Further, in some examples, theexecutable instructions are executed such that different instructionsare executed by different processors. As an illustrative example, anon-transitory computer-readable storage medium may store instructions.A main CPU may execute some of the instructions and a graphics processorunit may execute other of the instructions. Generally, differentcomponents of a computer system may have separate processors anddifferent processors may execute different subsets of the instructions.

Accordingly, in some examples, computer systems are configured toimplement one or more services that singly or collectively performoperations of processes described herein. Such computer systems may, forinstance, be configured with applicable hardware and/or software thatenable the performance of the operations. Further, computer systems thatimplement various embodiments of the present disclosure may, in someexamples, be single devices and, in other examples, be distributedcomputer systems comprising multiple devices that operate differentlysuch that the distributed computer system performs the operationsdescribed herein and such that a single device may not perform alloperations.

The use of any and all examples, or exemplary language (e.g., “such as”)provided herein, is intended merely to better illuminate embodiments ofthe invention and does not pose a limitation on the scope of theinvention unless otherwise claimed. No language in the specificationshould be construed as indicating any non-claimed element as essentialto the practice of the invention.

Embodiments of this disclosure are described herein, including the bestmode known to the inventors for carrying out the invention. Variationsof those embodiments may become apparent to those of ordinary skill inthe art upon reading the foregoing description. The inventors expectskilled artisans to employ such variations as appropriate and theinventors intend for embodiments of the present disclosure to bepracticed otherwise than as specifically described herein. Accordingly,the scope of the present disclosure includes all modifications andequivalents of the subject matter recited in the claims appended heretoas permitted by applicable law. Moreover, any combination of theabove-described elements in all possible variations thereof isencompassed by the scope of the present disclosure unless otherwiseindicated herein or otherwise clearly contradicted by context.

All references, including publications, patent applications, andpatents, cited herein are hereby incorporated by reference to the sameextent as if each reference were individually and specifically indicatedto be incorporated by reference and were set forth in its entiretyherein.

What is claimed is:
 1. A computer-implemented method for retrieving iteminformation from an image, the computer-implemented method executable bya hardware processor, the method comprising: receiving a request toquery the image; selecting an item in the image to generate a selecteditem; predicting, by a detector module, a bounding box around theselected item in the image; determining, by an extractor module, whethera quality score associated with the bounding box around the selecteditem in the image exceeds a quality score threshold; extracting afeature representation of the selected item in the image when thequality score associated with the bounding box around the selected itemin the image exceeds the quality score threshold; and identifying aproduct from a database of products based at least in part on thefeature representation of the selected item in the image.
 2. Thecomputer-implemented method of claim 1, wherein the image is one of aplurality of images in a video segment, and wherein the method furthercomprises: detecting a plurality of tracked items associated with theselected item, wherein each tracked item of the plurality of trackeditems corresponds to an image of the plurality of images in the videosegment; extracting a plurality of feature representations of theselected item, wherein each feature representation of the plurality offeature representations is extracted from a tracked item associated witha different image in the plurality of images in the video segment; andfusing a subset of the plurality of feature representations of theselected item to generate a fused feature representation of the selecteditem.
 3. The computer-implemented method of claim 2, the method furthercomprising: selecting the subset of the plurality of featurerepresentations of the selected item by determining whether each of theplurality of feature representations of the selected item is suitable.4. The computer-implemented method of claim 2, wherein the fusing thesubset of the plurality of feature representations of the selected itemto generate the fused feature representation of the selected itemcomprises temporal pooling.
 5. The computer-implemented method of claim4, wherein the temporal pooling is average pooling.
 6. Thecomputer-implemented method of claim 2, wherein the fusing the subset ofthe plurality of feature representations of the selected item togenerate the fused feature representation of the selected item comprisesexecuting a temporal attention mechanism.
 7. The computer-implementedmethod of claim 6, wherein the temporal attention mechanism is a softattention mechanism configured to determine a plurality of fusionweights, and wherein each of the plurality of fusion weights correspondsto one of the plurality of feature representations of the selected item.8. The computer-implemented method of claim 6, wherein the temporalattention mechanism is a hard attention mechanism modeled as a Markovdecision process configured to execute reinforcement learning.
 9. Thecomputer-implemented method of claim 2, further comprising: forming aplurality of tracklets associated with the selected item, wherein theplurality of tracklets is correlated with spatio-temporal informationassociated with the selected item.
 10. The computer-implemented methodof claim 9, wherein the spatio-temporal information is a location of theselected item.
 11. The computer-implemented method of claim 9, whereinthe spatio-temporal information is a direction and a speed of theselected item.
 12. The computer-implemented method of claim 2, whereinthe plurality of images in the video segment is sequential data, andwherein a recurrent neural network models a plurality of temporaldependencies among the plurality of images in the video segment.
 13. Thecomputer-implemented method of claim 2, wherein the plurality of imagesin the video segment is an image set, and wherein metric learning isapplied to a single image set-based model.
 14. The computer-implementedmethod of claim 2, wherein the quality score is generated by training aquality prediction head branch, and wherein the training data for thequality prediction head branch comprises data augmentation.
 15. Thecomputer-implemented method of claim 14, wherein the quality predictionhead branch is trained as a regressor from a plurality of convolutionalfeatures associated with the selected item.
 16. The computer-implementedmethod of claim 1, wherein the detector module comprises a neuralnetwork configured to map an input image to a plurality of boundingboxes and to a plurality of categories, and wherein each of theplurality of bounding boxes corresponds to one of the plurality ofcategories.
 17. The computer-implemented method of claim 1, wherein theextractor module comprises a neural network configured to extract froman input bounding box a plurality of discriminative features.
 18. Thecomputer-implemented method of claim 17, wherein the extractor module isconfigured to predict attributes of the selected item.
 19. Thecomputer-implemented method of claim 1, wherein the selected item is anarticle of clothing, and wherein the method further comprises:generating an enhanced profile of the selected item, wherein theenhanced profile comprises information selected from the groupconsisting of color, pattern, material, type of sleeve, and type ofcollar.
 20. A non-transitory computer-readable storage medium havingprogram instructions stored therein, for retrieving item informationfrom an image, the program instructions executable by a hardwareprocessor to cause the hardware processor to: receive a request to querythe image; select an item in the image to generate a selected item;predict, by a detector module, a bounding box around the selected itemin the image; determine, by an extractor module, whether a quality scoreassociated with the bounding box around the selected item in the imageexceeds a quality score threshold; extract a feature representation ofthe selected item in the image when the quality score associated withthe bounding box around the selected item in the image exceeds thequality score threshold; and identify a product from a database ofproducts based at least in part on the feature representation of theselected item in the image.